diff --git a/docs/img/webui-structured-streaming-detail.png b/docs/img/webui-structured-streaming-detail.png new file mode 100644 index 0000000000000..f4850523c5c2f Binary files /dev/null and b/docs/img/webui-structured-streaming-detail.png differ diff --git a/docs/sql-ref-datetime-pattern.md b/docs/sql-ref-datetime-pattern.md index df19b9ce4c082..4275f03335b33 100644 --- a/docs/sql-ref-datetime-pattern.md +++ b/docs/sql-ref-datetime-pattern.md @@ -30,25 +30,25 @@ Spark uses pattern letters in the following table for date and timestamp parsing |Symbol|Meaning|Presentation|Examples| |------|-------|------------|--------| -|**G**|era|text|AD; Anno Domini; A| +|**G**|era|text|AD; Anno Domini| |**y**|year|year|2020; 20| -|**D**|day-of-year|number|189| -|**M/L**|month-of-year|number/text|7; 07; Jul; July; J| -|**d**|day-of-month|number|28| +|**D**|day-of-year|number(3)|189| +|**M/L**|month-of-year|month|7; 07; Jul; July| +|**d**|day-of-month|number(3)|28| |**Q/q**|quarter-of-year|number/text|3; 03; Q3; 3rd quarter| |**Y**|week-based-year|year|1996; 96| -|**w**|week-of-week-based-year|number|27| -|**W**|week-of-month|number|4| -|**E**|day-of-week|text|Tue; Tuesday; T| -|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday; T| -|**F**|week-of-month|number|3| -|**a**|am-pm-of-day|text|PM| -|**h**|clock-hour-of-am-pm (1-12)|number|12| -|**K**|hour-of-am-pm (0-11)|number|0| -|**k**|clock-hour-of-day (1-24)|number|0| -|**H**|hour-of-day (0-23)|number|0| -|**m**|minute-of-hour|number|30| -|**s**|second-of-minute|number|55| +|**w**|week-of-week-based-year|number(2)|27| +|**W**|week-of-month|number(1)|4| +|**E**|day-of-week|text|Tue; Tuesday| +|**u**|localized day-of-week|number/text|2; 02; Tue; Tuesday| +|**F**|week-of-month|number(1)|3| +|**a**|am-pm-of-day|am-pm|PM| +|**h**|clock-hour-of-am-pm (1-12)|number(2)|12| +|**K**|hour-of-am-pm (0-11)|number(2)|0| +|**k**|clock-hour-of-day (1-24)|number(2)|0| +|**H**|hour-of-day (0-23)|number(2)|0| +|**m**|minute-of-hour|number(2)|30| +|**s**|second-of-minute|number(2)|55| |**S**|fraction-of-second|fraction|978| |**V**|time-zone ID|zone-id|America/Los_Angeles; Z; -08:30| |**z**|time-zone name|zone-name|Pacific Standard Time; PST| @@ -63,9 +63,9 @@ Spark uses pattern letters in the following table for date and timestamp parsing The count of pattern letters determines the format. -- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. Six or more letters will fail. +- Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. 5 or more letters will fail. -- Number: If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. The following pattern letters have constraints on the count of letters. Only one letter 'F' can be specified. Up to two letters of 'd', 'H', 'h', 'K', 'k', 'm', and 's' can be specified. Up to three letters of 'D' can be specified. +- Number(n): The n here represents the maximum count of letters this type of datetime pattern can be used. If the count of letters is one, then the value is output using the minimum number of digits and without padding. Otherwise, the count of digits is used as the width of the output field, with the value zero-padded as necessary. - Number/Text: If the count of pattern letters is 3 or greater, use the Text rules above. Otherwise use the Number rules above. @@ -76,7 +76,7 @@ The count of pattern letters determines the format. - Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present. -- Month: If the number of pattern letters is 3 or more, the month is interpreted as text; otherwise, it is interpreted as a number. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. The difference between the 'standard' and 'stand-alone' forms is trickier to describe as there is no difference in English. However, in other languages there is a difference in the word used when the text is used alone, as opposed to in a complete date. For example, the word used for a month when used alone in a date picker is different to the word used for month in association with a day and year in a date. In Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters (more than 5 letters is invalid): +- Month: If the number of pattern letters is 3 or more, the month is interpreted as text; otherwise, it is interpreted as a number. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. The difference between the 'standard' and 'stand-alone' forms is trickier to describe as there is no difference in English. However, in other languages there is a difference in the word used when the text is used alone, as opposed to in a complete date. For example, the word used for a month when used alone in a date picker is different to the word used for month in association with a day and year in a date. In Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters (more than 4 letters is invalid): - `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding. ```sql spark-sql> select date_format(date '1970-01-01', "M"); @@ -119,13 +119,8 @@ The count of pattern letters determines the format. spark-sql> select to_csv(named_struct('date', date '1970-01-01'), map('dateFormat', 'LLLL', 'locale', 'RU')); январь ``` - - `'LLLLL'` or `'MMMMM'`: Narrow textual representation of standard or stand-alone forms. Typically it is a single letter. - ```sql - spark-sql> select date_format(date '1970-07-01', "LLLLL"); - J - spark-sql> select date_format(date '1970-01-01', "MMMMM"); - J - ``` + +- am-pm: This outputs the am-pm-of-day. Pattern letter count must be 1. - Zone ID(V): This outputs the display the time-zone ID. Pattern letter count must be 2. @@ -147,5 +142,3 @@ More details for the text style: - Short Form: Short text, typically an abbreviation. For example, day-of-week Monday might output "Mon". - Full Form: Full text, typically the full description. For example, day-of-week Monday might output "Monday". - -- Narrow Form: Narrow text, typically a single letter. For example, day-of-week Monday might output "M". diff --git a/docs/web-ui.md b/docs/web-ui.md index 3c35dbeec86a2..e2e612cef3e54 100644 --- a/docs/web-ui.md +++ b/docs/web-ui.md @@ -407,6 +407,34 @@ Here is the list of SQL metrics: +## Structured Streaming Tab +When running Structured Streaming jobs in micro-batch mode, a Structured Streaming tab will be +available on the Web UI. The overview page displays some brief statistics for running and completed +queries. Also, you can check the latest exception of a failed query. For detailed statistics, please +click a "run id" in the tables. + +

+ Structured Streaming Query Statistics +

+ +The statistics page displays some useful metrics for insight into the status of your streaming +queries. Currently, it contains the following metrics. + +* **Input Rate.** The aggregate (across all sources) rate of data arriving. +* **Process Rate.** The aggregate (across all sources) rate at which Spark is processing data. +* **Input Rows.** The aggregate (across all sources) number of records processed in a trigger. +* **Batch Duration.** The process duration of each batch. +* **Operation Duration.** The amount of time taken to perform various operations in milliseconds. +The tracked operations are listed as follows. + * addBatch: Adds result data of the current batch to the sink. + * getBatch: Gets a new batch of data to process. + * latestOffset: Gets the latest offsets for sources. + * queryPlanning: Generates the execution plan. + * walCommit: Writes the offsets to the metadata log. + +As an early-release version, the statistics page is still under development and will be improved in +future releases. + ## Streaming Tab The web UI includes a Streaming tab if the application uses Spark streaming. This tab displays scheduling delay and processing time for each micro-batch in the data stream, which can be useful diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala index 63b99a0de4b65..19790fd270619 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala @@ -19,10 +19,11 @@ package org.apache.spark.ml.evaluation import org.apache.spark.annotation.Since import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol, HasWeightCol} import org.apache.spark.ml.util._ import org.apache.spark.sql.Dataset -import org.apache.spark.sql.functions.col +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DoubleType /** * Evaluator for clustering results. @@ -34,7 +35,8 @@ import org.apache.spark.sql.functions.col */ @Since("2.3.0") class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) - extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + extends Evaluator with HasPredictionCol with HasFeaturesCol with HasWeightCol + with DefaultParamsWritable { @Since("2.3.0") def this() = this(Identifiable.randomUID("cluEval")) @@ -53,6 +55,10 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") def setFeaturesCol(value: String): this.type = set(featuresCol, value) + /** @group setParam */ + @Since("3.1.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + /** * param for metric name in evaluation * (supports `"silhouette"` (default)) @@ -116,12 +122,26 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str */ @Since("3.1.0") def getMetrics(dataset: Dataset[_]): ClusteringMetrics = { - SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) - SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) + val schema = dataset.schema + SchemaUtils.validateVectorCompatibleColumn(schema, $(featuresCol)) + SchemaUtils.checkNumericType(schema, $(predictionCol)) + if (isDefined(weightCol)) { + SchemaUtils.checkNumericType(schema, $(weightCol)) + } + + val weightColName = if (!isDefined(weightCol)) "weightCol" else $(weightCol) val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) - val df = dataset.select(col($(predictionCol)), - vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata)) + val df = if (!isDefined(weightCol) || $(weightCol).isEmpty) { + dataset.select(col($(predictionCol)), + vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata), + lit(1.0).as(weightColName)) + } else { + dataset.select(col($(predictionCol)), + vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata), + col(weightColName).cast(DoubleType)) + } + val metrics = new ClusteringMetrics(df) metrics.setDistanceMeasure($(distanceMeasure)) metrics diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala index 30970337d7d3b..8bf4ee1ecadfb 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala @@ -47,9 +47,9 @@ class ClusteringMetrics private[spark](dataset: Dataset[_]) { val columns = dataset.columns.toSeq if (distanceMeasure.equalsIgnoreCase("squaredEuclidean")) { SquaredEuclideanSilhouette.computeSilhouetteScore( - dataset, columns(0), columns(1)) + dataset, columns(0), columns(1), columns(2)) } else { - CosineSilhouette.computeSilhouetteScore(dataset, columns(0), columns(1)) + CosineSilhouette.computeSilhouetteScore(dataset, columns(0), columns(1), columns(2)) } } } @@ -63,9 +63,10 @@ private[evaluation] abstract class Silhouette { def pointSilhouetteCoefficient( clusterIds: Set[Double], pointClusterId: Double, - pointClusterNumOfPoints: Long, + weightSum: Double, + weight: Double, averageDistanceToCluster: (Double) => Double): Double = { - if (pointClusterNumOfPoints == 1) { + if (weightSum == weight) { // Single-element clusters have silhouette 0 0.0 } else { @@ -77,8 +78,8 @@ private[evaluation] abstract class Silhouette { val neighboringClusterDissimilarity = otherClusterIds.map(averageDistanceToCluster).min // adjustment for excluding the node itself from the computation of the average dissimilarity val currentClusterDissimilarity = - averageDistanceToCluster(pointClusterId) * pointClusterNumOfPoints / - (pointClusterNumOfPoints - 1) + averageDistanceToCluster(pointClusterId) * weightSum / + (weightSum - weight) if (currentClusterDissimilarity < neighboringClusterDissimilarity) { 1 - (currentClusterDissimilarity / neighboringClusterDissimilarity) } else if (currentClusterDissimilarity > neighboringClusterDissimilarity) { @@ -92,8 +93,8 @@ private[evaluation] abstract class Silhouette { /** * Compute the mean Silhouette values of all samples. */ - def overallScore(df: DataFrame, scoreColumn: Column): Double = { - df.select(avg(scoreColumn)).collect()(0).getDouble(0) + def overallScore(df: DataFrame, scoreColumn: Column, weightColumn: Column): Double = { + df.select(sum(scoreColumn * weightColumn) / sum(weightColumn)).collect()(0).getDouble(0) } } @@ -267,7 +268,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { } } - case class ClusterStats(featureSum: Vector, squaredNormSum: Double, numOfPoints: Long) + case class ClusterStats(featureSum: Vector, squaredNormSum: Double, weightSum: Double) /** * The method takes the input dataset and computes the aggregated values @@ -277,6 +278,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { * @param predictionCol The name of the column which contains the predicted cluster id * for the point. * @param featuresCol The name of the column which contains the feature vector of the point. + * @param weightCol The name of the column which contains the instance weight. * @return A [[scala.collection.immutable.Map]] which associates each cluster id * to a [[ClusterStats]] object (which contains the precomputed values `N`, * `$\Psi_{\Gamma}$` and `$Y_{\Gamma}$` for a cluster). @@ -284,36 +286,39 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { def computeClusterStats( df: DataFrame, predictionCol: String, - featuresCol: String): Map[Double, ClusterStats] = { + featuresCol: String, + weightCol: String): Map[Double, ClusterStats] = { val numFeatures = MetadataUtils.getNumFeatures(df, featuresCol) val clustersStatsRDD = df.select( - col(predictionCol).cast(DoubleType), col(featuresCol), col("squaredNorm")) + col(predictionCol).cast(DoubleType), col(featuresCol), col("squaredNorm"), col(weightCol)) .rdd - .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) } - .aggregateByKey[(DenseVector, Double, Long)]((Vectors.zeros(numFeatures).toDense, 0.0, 0L))( - seqOp = { - case ( - (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long), - (features, squaredNorm) - ) => - BLAS.axpy(1.0, features, featureSum) - (featureSum, squaredNormSum + squaredNorm, numOfPoints + 1) - }, - combOp = { - case ( - (featureSum1, squaredNormSum1, numOfPoints1), - (featureSum2, squaredNormSum2, numOfPoints2) - ) => - BLAS.axpy(1.0, featureSum2, featureSum1) - (featureSum1, squaredNormSum1 + squaredNormSum2, numOfPoints1 + numOfPoints2) - } - ) + .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2), row.getDouble(3))) } + .aggregateByKey + [(DenseVector, Double, Double)]((Vectors.zeros(numFeatures).toDense, 0.0, 0.0))( + seqOp = { + case ( + (featureSum: DenseVector, squaredNormSum: Double, weightSum: Double), + (features, squaredNorm, weight) + ) => + require(weight >= 0.0, s"illegal weight value: $weight. weight must be >= 0.0.") + BLAS.axpy(weight, features, featureSum) + (featureSum, squaredNormSum + squaredNorm * weight, weightSum + weight) + }, + combOp = { + case ( + (featureSum1, squaredNormSum1, weightSum1), + (featureSum2, squaredNormSum2, weightSum2) + ) => + BLAS.axpy(1.0, featureSum2, featureSum1) + (featureSum1, squaredNormSum1 + squaredNormSum2, weightSum1 + weightSum2) + } + ) clustersStatsRDD .collectAsMap() .mapValues { - case (featureSum: DenseVector, squaredNormSum: Double, numOfPoints: Long) => - SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, numOfPoints) + case (featureSum: DenseVector, squaredNormSum: Double, weightSum: Double) => + SquaredEuclideanSilhouette.ClusterStats(featureSum, squaredNormSum, weightSum) } .toMap } @@ -324,6 +329,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { * @param broadcastedClustersMap A map of the precomputed values for each cluster. * @param point The [[org.apache.spark.ml.linalg.Vector]] representing the current point. * @param clusterId The id of the cluster the current point belongs to. + * @param weight The instance weight of the current point. * @param squaredNorm The `$\Xi_{X}$` (which is the squared norm) precomputed for the point. * @return The Silhouette for the point. */ @@ -331,6 +337,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { broadcastedClustersMap: Broadcast[Map[Double, ClusterStats]], point: Vector, clusterId: Double, + weight: Double, squaredNorm: Double): Double = { def compute(targetClusterId: Double): Double = { @@ -338,13 +345,14 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { val pointDotClusterFeaturesSum = BLAS.dot(point, clusterStats.featureSum) squaredNorm + - clusterStats.squaredNormSum / clusterStats.numOfPoints - - 2 * pointDotClusterFeaturesSum / clusterStats.numOfPoints + clusterStats.squaredNormSum / clusterStats.weightSum - + 2 * pointDotClusterFeaturesSum / clusterStats.weightSum } pointSilhouetteCoefficient(broadcastedClustersMap.value.keySet, clusterId, - broadcastedClustersMap.value(clusterId).numOfPoints, + broadcastedClustersMap.value(clusterId).weightSum, + weight, compute) } @@ -355,12 +363,14 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { * @param predictionCol The name of the column which contains the predicted cluster id * for the point. * @param featuresCol The name of the column which contains the feature vector of the point. + * @param weightCol The name of the column which contains instance weight. * @return The average of the Silhouette values of the clustered data. */ def computeSilhouetteScore( dataset: Dataset[_], predictionCol: String, - featuresCol: String): Double = { + featuresCol: String, + weightCol: String): Double = { SquaredEuclideanSilhouette.registerKryoClasses(dataset.sparkSession.sparkContext) val squaredNormUDF = udf { @@ -370,7 +380,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { // compute aggregate values for clusters needed by the algorithm val clustersStatsMap = SquaredEuclideanSilhouette - .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol) + .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol, weightCol) // Silhouette is reasonable only when the number of clusters is greater then 1 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than one.") @@ -378,12 +388,12 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) val computeSilhouetteCoefficientUDF = udf { - computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) + computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double, _: Double) } val silhouetteScore = overallScore(dfWithSquaredNorm, computeSilhouetteCoefficientUDF(col(featuresCol), col(predictionCol).cast(DoubleType), - col("squaredNorm"))) + col(weightCol), col("squaredNorm")), col(weightCol)) bClustersStatsMap.destroy() @@ -472,30 +482,35 @@ private[evaluation] object CosineSilhouette extends Silhouette { * about a cluster which are needed by the algorithm. * * @param df The DataFrame which contains the input data + * @param featuresCol The name of the column which contains the feature vector of the point. * @param predictionCol The name of the column which contains the predicted cluster id * for the point. + * @param weightCol The name of the column which contains the instance weight. * @return A [[scala.collection.immutable.Map]] which associates each cluster id to a * its statistics (ie. the precomputed values `N` and `$\Omega_{\Gamma}$`). */ def computeClusterStats( df: DataFrame, featuresCol: String, - predictionCol: String): Map[Double, (Vector, Long)] = { + predictionCol: String, + weightCol: String): Map[Double, (Vector, Double)] = { val numFeatures = MetadataUtils.getNumFeatures(df, featuresCol) val clustersStatsRDD = df.select( - col(predictionCol).cast(DoubleType), col(normalizedFeaturesColName)) + col(predictionCol).cast(DoubleType), col(normalizedFeaturesColName), col(weightCol)) .rdd - .map { row => (row.getDouble(0), row.getAs[Vector](1)) } - .aggregateByKey[(DenseVector, Long)]((Vectors.zeros(numFeatures).toDense, 0L))( + .map { row => (row.getDouble(0), (row.getAs[Vector](1), row.getDouble(2))) } + .aggregateByKey[(DenseVector, Double)]((Vectors.zeros(numFeatures).toDense, 0.0))( seqOp = { - case ((normalizedFeaturesSum: DenseVector, numOfPoints: Long), (normalizedFeatures)) => - BLAS.axpy(1.0, normalizedFeatures, normalizedFeaturesSum) - (normalizedFeaturesSum, numOfPoints + 1) + case ((normalizedFeaturesSum: DenseVector, weightSum: Double), + (normalizedFeatures, weight)) => + require(weight >= 0.0, s"illegal weight value: $weight. weight must be >= 0.0.") + BLAS.axpy(weight, normalizedFeatures, normalizedFeaturesSum) + (normalizedFeaturesSum, weightSum + weight) }, combOp = { - case ((normalizedFeaturesSum1, numOfPoints1), (normalizedFeaturesSum2, numOfPoints2)) => + case ((normalizedFeaturesSum1, weightSum1), (normalizedFeaturesSum2, weightSum2)) => BLAS.axpy(1.0, normalizedFeaturesSum2, normalizedFeaturesSum1) - (normalizedFeaturesSum1, numOfPoints1 + numOfPoints2) + (normalizedFeaturesSum1, weightSum1 + weightSum2) } ) @@ -511,11 +526,13 @@ private[evaluation] object CosineSilhouette extends Silhouette { * @param normalizedFeatures The [[org.apache.spark.ml.linalg.Vector]] representing the * normalized features of the current point. * @param clusterId The id of the cluster the current point belongs to. + * @param weight The instance weight of the current point. */ def computeSilhouetteCoefficient( - broadcastedClustersMap: Broadcast[Map[Double, (Vector, Long)]], + broadcastedClustersMap: Broadcast[Map[Double, (Vector, Double)]], normalizedFeatures: Vector, - clusterId: Double): Double = { + clusterId: Double, + weight: Double): Double = { def compute(targetClusterId: Double): Double = { val (normalizedFeatureSum, numOfPoints) = broadcastedClustersMap.value(targetClusterId) @@ -525,6 +542,7 @@ private[evaluation] object CosineSilhouette extends Silhouette { pointSilhouetteCoefficient(broadcastedClustersMap.value.keySet, clusterId, broadcastedClustersMap.value(clusterId)._2, + weight, compute) } @@ -535,12 +553,14 @@ private[evaluation] object CosineSilhouette extends Silhouette { * @param predictionCol The name of the column which contains the predicted cluster id * for the point. * @param featuresCol The name of the column which contains the feature vector of the point. + * @param weightCol The name of the column which contains the instance weight. * @return The average of the Silhouette values of the clustered data. */ def computeSilhouetteScore( dataset: Dataset[_], predictionCol: String, - featuresCol: String): Double = { + featuresCol: String, + weightCol: String): Double = { val normalizeFeatureUDF = udf { features: Vector => { val norm = Vectors.norm(features, 2.0) @@ -553,7 +573,7 @@ private[evaluation] object CosineSilhouette extends Silhouette { // compute aggregate values for clusters needed by the algorithm val clustersStatsMap = computeClusterStats(dfWithNormalizedFeatures, featuresCol, - predictionCol) + predictionCol, weightCol) // Silhouette is reasonable only when the number of clusters is greater then 1 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than one.") @@ -561,12 +581,12 @@ private[evaluation] object CosineSilhouette extends Silhouette { val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) val computeSilhouetteCoefficientUDF = udf { - computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double) + computeSilhouetteCoefficient(bClustersStatsMap, _: Vector, _: Double, _: Double) } val silhouetteScore = overallScore(dfWithNormalizedFeatures, computeSilhouetteCoefficientUDF(col(normalizedFeaturesColName), - col(predictionCol).cast(DoubleType))) + col(predictionCol).cast(DoubleType), col(weightCol)), col(weightCol)) bClustersStatsMap.destroy() diff --git a/mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala index 29fed5322c9c9..d4c620adc2e3c 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala @@ -19,12 +19,13 @@ package org.apache.spark.ml.evaluation import org.apache.spark.{SparkException, SparkFunSuite} import org.apache.spark.ml.attribute.AttributeGroup -import org.apache.spark.ml.linalg.Vector +import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamsSuite import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils} import org.apache.spark.ml.util.TestingUtils._ import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.lit class ClusteringEvaluatorSuite @@ -161,4 +162,44 @@ class ClusteringEvaluatorSuite assert(evaluator.evaluate(irisDataset) == silhouetteScoreCosin) } + + test("test weight support") { + Seq("squaredEuclidean", "cosine").foreach { distanceMeasure => + val evaluator1 = new ClusteringEvaluator() + .setFeaturesCol("features") + .setPredictionCol("label") + .setDistanceMeasure(distanceMeasure) + + val evaluator2 = new ClusteringEvaluator() + .setFeaturesCol("features") + .setPredictionCol("label") + .setDistanceMeasure(distanceMeasure) + .setWeightCol("weight") + + Seq(0.25, 1.0, 10.0, 99.99).foreach { w => + var score1 = evaluator1.evaluate(irisDataset) + var score2 = evaluator2.evaluate(irisDataset.withColumn("weight", lit(w))) + assert(score1 ~== score2 relTol 1e-6) + + score1 = evaluator1.evaluate(newIrisDataset) + score2 = evaluator2.evaluate(newIrisDataset.withColumn("weight", lit(w))) + assert(score1 ~== score2 relTol 1e-6) + } + } + } + + test("single-element clusters with weight") { + val singleItemClusters = spark.createDataFrame(spark.sparkContext.parallelize(Array( + (0.0, Vectors.dense(5.1, 3.5, 1.4, 0.2), 6.0), + (1.0, Vectors.dense(7.0, 3.2, 4.7, 1.4), 0.25), + (2.0, Vectors.dense(6.3, 3.3, 6.0, 2.5), 9.99)))).toDF("label", "features", "weight") + Seq("squaredEuclidean", "cosine").foreach { distanceMeasure => + val evaluator = new ClusteringEvaluator() + .setFeaturesCol("features") + .setPredictionCol("label") + .setDistanceMeasure(distanceMeasure) + .setWeightCol("weight") + assert(evaluator.evaluate(singleItemClusters) === 0.0) + } + } } diff --git a/pom.xml b/pom.xml index e98bcc033f5f1..1b225bde6774d 100644 --- a/pom.xml +++ b/pom.xml @@ -1363,6 +1363,10 @@ com.zaxxer HikariCP-java7 + + com.microsoft.sqlserver + mssql-jdbc + diff --git a/python/pyspark/context.py b/python/pyspark/context.py index b80149afa2af4..4f29f2f0be1e8 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -25,7 +25,6 @@ from tempfile import NamedTemporaryFile from py4j.protocol import Py4JError -from py4j.java_gateway import is_instance_of from pyspark import accumulators from pyspark.accumulators import Accumulator @@ -865,17 +864,10 @@ def union(self, rdds): first_jrdd_deserializer = rdds[0]._jrdd_deserializer if any(x._jrdd_deserializer != first_jrdd_deserializer for x in rdds): rdds = [x._reserialize() for x in rdds] - gw = SparkContext._gateway cls = SparkContext._jvm.org.apache.spark.api.java.JavaRDD - is_jrdd = is_instance_of(gw, rdds[0]._jrdd, cls) - jrdds = gw.new_array(cls, len(rdds)) + jrdds = SparkContext._gateway.new_array(cls, len(rdds)) for i in range(0, len(rdds)): - if is_jrdd: - jrdds[i] = rdds[i]._jrdd - else: - # zip could return JavaPairRDD hence we ensure `_jrdd` - # to be `JavaRDD` by wrapping it in a `map` - jrdds[i] = rdds[i].map(lambda x: x)._jrdd + jrdds[i] = rdds[i]._jrdd return RDD(self._jsc.union(jrdds), self, rdds[0]._jrdd_deserializer) def broadcast(self, value): diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index 265f02c1a03ac..a69a57f588571 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -654,7 +654,7 @@ def setParams(self, predictionCol="prediction", labelCol="label", @inherit_doc -class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, HasWeightCol, JavaMLReadable, JavaMLWritable): """ Evaluator for Clustering results, which expects two input @@ -677,6 +677,18 @@ class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, ClusteringEvaluator... >>> evaluator.evaluate(dataset) 0.9079... + >>> featureAndPredictionsWithWeight = map(lambda x: (Vectors.dense(x[0]), x[1], x[2]), + ... [([0.0, 0.5], 0.0, 2.5), ([0.5, 0.0], 0.0, 2.5), ([10.0, 11.0], 1.0, 2.5), + ... ([10.5, 11.5], 1.0, 2.5), ([1.0, 1.0], 0.0, 2.5), ([8.0, 6.0], 1.0, 2.5)]) + >>> dataset = spark.createDataFrame( + ... featureAndPredictionsWithWeight, ["features", "prediction", "weight"]) + >>> evaluator = ClusteringEvaluator() + >>> evaluator.setPredictionCol("prediction") + ClusteringEvaluator... + >>> evaluator.setWeightCol("weight") + ClusteringEvaluator... + >>> evaluator.evaluate(dataset) + 0.9079... >>> ce_path = temp_path + "/ce" >>> evaluator.save(ce_path) >>> evaluator2 = ClusteringEvaluator.load(ce_path) @@ -694,10 +706,10 @@ class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, @keyword_only def __init__(self, predictionCol="prediction", featuresCol="features", - metricName="silhouette", distanceMeasure="squaredEuclidean"): + metricName="silhouette", distanceMeasure="squaredEuclidean", weightCol=None): """ __init__(self, predictionCol="prediction", featuresCol="features", \ - metricName="silhouette", distanceMeasure="squaredEuclidean") + metricName="silhouette", distanceMeasure="squaredEuclidean", weightCol=None) """ super(ClusteringEvaluator, self).__init__() self._java_obj = self._new_java_obj( @@ -709,10 +721,10 @@ def __init__(self, predictionCol="prediction", featuresCol="features", @keyword_only @since("2.3.0") def setParams(self, predictionCol="prediction", featuresCol="features", - metricName="silhouette", distanceMeasure="squaredEuclidean"): + metricName="silhouette", distanceMeasure="squaredEuclidean", weightCol=None): """ setParams(self, predictionCol="prediction", featuresCol="features", \ - metricName="silhouette", distanceMeasure="squaredEuclidean") + metricName="silhouette", distanceMeasure="squaredEuclidean", weightCol=None) Sets params for clustering evaluator. """ kwargs = self._input_kwargs @@ -758,6 +770,13 @@ def setPredictionCol(self, value): """ return self._set(predictionCol=value) + @since("3.1.0") + def setWeightCol(self, value): + """ + Sets the value of :py:attr:`weightCol`. + """ + return self._set(weightCol=value) + @inherit_doc class RankingEvaluator(JavaEvaluator, HasLabelCol, HasPredictionCol, diff --git a/python/pyspark/tests/test_rdd.py b/python/pyspark/tests/test_rdd.py index 04dfe68e57a3a..62ad4221d7078 100644 --- a/python/pyspark/tests/test_rdd.py +++ b/python/pyspark/tests/test_rdd.py @@ -168,15 +168,6 @@ def test_zip_chaining(self): set([(x, (x, x)) for x in 'abc']) ) - def test_union_pair_rdd(self): - # Regression test for SPARK-31788 - rdd = self.sc.parallelize([1, 2]) - pair_rdd = rdd.zip(rdd) - self.assertEqual( - self.sc.union([pair_rdd, pair_rdd]).collect(), - [((1, 1), (2, 2)), ((1, 1), (2, 2))] - ) - def test_deleting_input_files(self): # Regression test for SPARK-1025 tempFile = tempfile.NamedTemporaryFile(delete=False) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala index 5e53927885ca4..e2559d4c07297 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala @@ -339,7 +339,7 @@ object FunctionRegistry { expression[GetJsonObject]("get_json_object"), expression[InitCap]("initcap"), expression[StringInstr]("instr"), - expression[Lower]("lcase"), + expression[Lower]("lcase", true), expression[Length]("length"), expression[Levenshtein]("levenshtein"), expression[Like]("like"), @@ -350,7 +350,7 @@ object FunctionRegistry { expression[StringTrimLeft]("ltrim"), expression[JsonTuple]("json_tuple"), expression[ParseUrl]("parse_url"), - expression[StringLocate]("position"), + expression[StringLocate]("position", true), expression[FormatString]("printf", true), expression[RegExpExtract]("regexp_extract"), expression[RegExpReplace]("regexp_replace"), @@ -491,6 +491,7 @@ object FunctionRegistry { expression[InputFileBlockLength]("input_file_block_length"), expression[MonotonicallyIncreasingID]("monotonically_increasing_id"), expression[CurrentDatabase]("current_database"), + expression[CurrentCatalog]("current_catalog"), expression[CallMethodViaReflection]("reflect"), expression[CallMethodViaReflection]("java_method", true), expression[SparkVersion]("version"), diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala index 8e87a82769471..f2bb7db895ca2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala @@ -85,13 +85,13 @@ class UnivocityParser( // We preallocate it avoid unnecessary allocations. private val noRows = None - private val timestampFormatter = TimestampFormatter( + private lazy val timestampFormatter = TimestampFormatter( options.timestampFormat, options.zoneId, options.locale, legacyFormat = FAST_DATE_FORMAT, needVarLengthSecondFraction = true) - private val dateFormatter = DateFormatter( + private lazy val dateFormatter = DateFormatter( options.dateFormat, options.zoneId, options.locale, diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala index fe3fea5e35b1b..26f5bee72092c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala @@ -18,7 +18,7 @@ package org.apache.spark.sql.catalyst import java.sql.{Date, Timestamp} -import java.time.LocalDate +import java.time.{Instant, LocalDate} import scala.language.implicitConversions @@ -152,6 +152,7 @@ package object dsl { implicit def bigDecimalToLiteral(d: java.math.BigDecimal): Literal = Literal(d) implicit def decimalToLiteral(d: Decimal): Literal = Literal(d) implicit def timestampToLiteral(t: Timestamp): Literal = Literal(t) + implicit def instantToLiteral(i: Instant): Literal = Literal(i) implicit def binaryToLiteral(a: Array[Byte]): Literal = Literal(a) implicit def symbolToUnresolvedAttribute(s: Symbol): analysis.UnresolvedAttribute = diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala index 858c91a4d8e86..5212ef3930bc9 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala @@ -19,7 +19,7 @@ package org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.{TypeCheckResult, TypeCoercion} -import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder +import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.{FUNC_ALIAS, FunctionBuilder} import org.apache.spark.sql.catalyst.expressions.codegen._ import org.apache.spark.sql.catalyst.expressions.codegen.Block._ import org.apache.spark.sql.catalyst.util._ @@ -311,7 +311,12 @@ case object NamePlaceholder extends LeafExpression with Unevaluable { /** * Returns a Row containing the evaluation of all children expressions. */ -object CreateStruct extends FunctionBuilder { +object CreateStruct { + /** + * Returns a named struct with generated names or using the names when available. + * It should not be used for `struct` expressions or functions explicitly called + * by users. + */ def apply(children: Seq[Expression]): CreateNamedStruct = { CreateNamedStruct(children.zipWithIndex.flatMap { case (e: NamedExpression, _) if e.resolved => Seq(Literal(e.name), e) @@ -320,12 +325,23 @@ object CreateStruct extends FunctionBuilder { }) } + /** + * Returns a named struct with a pretty SQL name. It will show the pretty SQL string + * in its output column name as if `struct(...)` was called. Should be + * used for `struct` expressions or functions explicitly called by users. + */ + def create(children: Seq[Expression]): CreateNamedStruct = { + val expr = CreateStruct(children) + expr.setTagValue(FUNC_ALIAS, "struct") + expr + } + /** * Entry to use in the function registry. */ val registryEntry: (String, (ExpressionInfo, FunctionBuilder)) = { val info: ExpressionInfo = new ExpressionInfo( - "org.apache.spark.sql.catalyst.expressions.NamedStruct", + classOf[CreateNamedStruct].getCanonicalName, null, "struct", "_FUNC_(col1, col2, col3, ...) - Creates a struct with the given field values.", @@ -335,7 +351,7 @@ object CreateStruct extends FunctionBuilder { "", "", "") - ("struct", (info, this)) + ("struct", (info, this.create)) } } @@ -433,7 +449,15 @@ case class CreateNamedStruct(children: Seq[Expression]) extends Expression { """.stripMargin, isNull = FalseLiteral) } - override def prettyName: String = "named_struct" + // There is an alias set at `CreateStruct.create`. If there is an alias, + // this is the struct function explicitly called by a user and we should + // respect it in the SQL string as `struct(...)`. + override def prettyName: String = getTagValue(FUNC_ALIAS).getOrElse("named_struct") + + override def sql: String = getTagValue(FUNC_ALIAS).map { alias => + val childrenSQL = children.indices.filter(_ % 2 == 1).map(children(_).sql).mkString(", ") + s"$alias($childrenSQL)" + }.getOrElse(super.sql) } /** diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala index afc57aa546fe8..7dc008a2e5df8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala @@ -880,6 +880,7 @@ abstract class ToTimestamp legacyFormat = SIMPLE_DATE_FORMAT, needVarLengthSecondFraction = true) } catch { + case e: SparkUpgradeException => throw e case NonFatal(_) => null } @@ -1061,6 +1062,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[ legacyFormat = SIMPLE_DATE_FORMAT, needVarLengthSecondFraction = false) } catch { + case e: SparkUpgradeException => throw e case NonFatal(_) => null } @@ -1076,6 +1078,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[ try { UTF8String.fromString(formatter.format(time.asInstanceOf[Long] * MICROS_PER_SECOND)) } catch { + case e: SparkUpgradeException => throw e case NonFatal(_) => null } } @@ -1093,6 +1096,7 @@ case class FromUnixTime(sec: Expression, format: Expression, timeZoneId: Option[ needVarLengthSecondFraction = false) .format(time.asInstanceOf[Long] * MICROS_PER_SECOND)) } catch { + case e: SparkUpgradeException => throw e case NonFatal(_) => null } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala index 66e6334e3a450..8c6fbc0fc8e44 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala @@ -491,7 +491,9 @@ case class Factorial(child: Expression) extends UnaryExpression with ImplicitCas > SELECT _FUNC_(1); 0.0 """) -case class Log(child: Expression) extends UnaryLogExpression(StrictMath.log, "LOG") +case class Log(child: Expression) extends UnaryLogExpression(StrictMath.log, "LOG") { + override def prettyName: String = getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("ln") +} @ExpressionDescription( usage = "_FUNC_(expr) - Returns the logarithm of `expr` with base 2.", @@ -546,6 +548,7 @@ case class Log1p(child: Expression) extends UnaryLogExpression(StrictMath.log1p, // scalastyle:on line.size.limit case class Rint(child: Expression) extends UnaryMathExpression(math.rint, "ROUND") { override def funcName: String = "rint" + override def prettyName: String = getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("rint") } @ExpressionDescription( diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala index 8ce3ddd30a69e..617ddcb69eab0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala @@ -116,6 +116,24 @@ case class CurrentDatabase() extends LeafExpression with Unevaluable { override def prettyName: String = "current_database" } +/** + * Returns the current catalog. + */ +@ExpressionDescription( + usage = "_FUNC_() - Returns the current catalog.", + examples = """ + Examples: + > SELECT _FUNC_(); + spark_catalog + """, + since = "3.1.0") +case class CurrentCatalog() extends LeafExpression with Unevaluable { + override def dataType: DataType = StringType + override def foldable: Boolean = true + override def nullable: Boolean = false + override def prettyName: String = "current_catalog" +} + // scalastyle:off line.size.limit @ExpressionDescription( usage = """_FUNC_() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string.""", diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala index 0b9fb8f85fe3c..876588e096d4a 100755 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala @@ -365,6 +365,9 @@ case class Lower(child: Expression) extends UnaryExpression with String2StringEx override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { defineCodeGen(ctx, ev, c => s"($c).toLowerCase()") } + + override def prettyName: String = + getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("lower") } /** A base trait for functions that compare two strings, returning a boolean. */ @@ -1182,7 +1185,8 @@ case class StringLocate(substr: Expression, str: Expression, start: Expression) """) } - override def prettyName: String = "locate" + override def prettyName: String = + getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("locate") } /** diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala index ef987931e928a..c4f6121723491 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala @@ -56,13 +56,13 @@ class JacksonParser( private val factory = options.buildJsonFactory() - private val timestampFormatter = TimestampFormatter( + private lazy val timestampFormatter = TimestampFormatter( options.timestampFormat, options.zoneId, options.locale, legacyFormat = FAST_DATE_FORMAT, needVarLengthSecondFraction = true) - private val dateFormatter = DateFormatter( + private lazy val dateFormatter = DateFormatter( options.dateFormat, options.zoneId, options.locale, diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index e59e3b999aa7f..f1a307b1c2cc1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -133,7 +133,7 @@ abstract class Optimizer(catalogManager: CatalogManager) ReplaceExpressions, RewriteNonCorrelatedExists, ComputeCurrentTime, - GetCurrentDatabase(catalogManager), + GetCurrentDatabaseAndCatalog(catalogManager), RewriteDistinctAggregates, ReplaceDeduplicateWithAggregate) :: ////////////////////////////////////////////////////////////////////////////////////////// @@ -223,7 +223,7 @@ abstract class Optimizer(catalogManager: CatalogManager) EliminateView.ruleName :: ReplaceExpressions.ruleName :: ComputeCurrentTime.ruleName :: - GetCurrentDatabase(catalogManager).ruleName :: + GetCurrentDatabaseAndCatalog(catalogManager).ruleName :: RewriteDistinctAggregates.ruleName :: ReplaceDeduplicateWithAggregate.ruleName :: ReplaceIntersectWithSemiJoin.ruleName :: diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 80d85827657fd..6c9bb6db06d86 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -91,15 +91,21 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { } -/** Replaces the expression of CurrentDatabase with the current database name. */ -case class GetCurrentDatabase(catalogManager: CatalogManager) extends Rule[LogicalPlan] { +/** + * Replaces the expression of CurrentDatabase with the current database name. + * Replaces the expression of CurrentCatalog with the current catalog name. + */ +case class GetCurrentDatabaseAndCatalog(catalogManager: CatalogManager) extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ val currentNamespace = catalogManager.currentNamespace.quoted + val currentCatalog = catalogManager.currentCatalog.name() plan transformAllExpressions { case CurrentDatabase() => Literal.create(currentNamespace, StringType) + case CurrentCatalog() => + Literal.create(currentCatalog, StringType) } } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala index c0cecf8536c39..03571a740df3e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala @@ -1534,7 +1534,7 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging * Create a [[CreateStruct]] expression. */ override def visitStruct(ctx: StructContext): Expression = withOrigin(ctx) { - CreateStruct(ctx.argument.asScala.map(expression)) + CreateStruct.create(ctx.argument.asScala.map(expression)) } /** diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala index 8261f57916fa2..06e1cdc27e7d5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala @@ -19,6 +19,7 @@ package org.apache.spark.sql.catalyst.util import java.text.SimpleDateFormat import java.time.{LocalDate, ZoneId} +import java.time.format.DateTimeFormatter import java.util.{Date, Locale} import org.apache.commons.lang3.time.FastDateFormat @@ -33,6 +34,8 @@ sealed trait DateFormatter extends Serializable { def format(days: Int): String def format(date: Date): String def format(localDate: LocalDate): String + + def validatePatternString(): Unit } class Iso8601DateFormatter( @@ -70,6 +73,12 @@ class Iso8601DateFormatter( override def format(date: Date): String = { legacyFormatter.format(date) } + + override def validatePatternString(): Unit = { + try { + formatter + } catch checkLegacyFormatter(pattern, legacyFormatter.validatePatternString) + } } trait LegacyDateFormatter extends DateFormatter { @@ -93,6 +102,7 @@ class LegacyFastDateFormatter(pattern: String, locale: Locale) extends LegacyDat private lazy val fdf = FastDateFormat.getInstance(pattern, locale) override def parseToDate(s: String): Date = fdf.parse(s) override def format(d: Date): String = fdf.format(d) + override def validatePatternString(): Unit = fdf } class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter { @@ -100,6 +110,8 @@ class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyD private lazy val sdf = new SimpleDateFormat(pattern, locale) override def parseToDate(s: String): Date = sdf.parse(s) override def format(d: Date): String = sdf.format(d) + override def validatePatternString(): Unit = sdf + } object DateFormatter { @@ -118,7 +130,9 @@ object DateFormatter { if (SQLConf.get.legacyTimeParserPolicy == LEGACY) { getLegacyFormatter(pattern, zoneId, locale, legacyFormat) } else { - new Iso8601DateFormatter(pattern, zoneId, locale, legacyFormat) + val df = new Iso8601DateFormatter(pattern, zoneId, locale, legacyFormat) + df.validatePatternString() + df } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala index 35f95dbffca6e..0ea54c28cb285 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala @@ -117,6 +117,34 @@ trait DateTimeFormatterHelper { s"set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY to restore the behavior " + s"before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.", e) } + + /** + * When the new DateTimeFormatter failed to initialize because of invalid datetime pattern, it + * will throw IllegalArgumentException. If the pattern can be recognized by the legacy formatter + * it will raise SparkUpgradeException to tell users to restore the previous behavior via LEGACY + * policy or follow our guide to correct their pattern. Otherwise, the original + * IllegalArgumentException will be thrown. + * + * @param pattern the date time pattern + * @param tryLegacyFormatter a func to capture exception, identically which forces a legacy + * datetime formatter to be initialized + */ + + protected def checkLegacyFormatter( + pattern: String, + tryLegacyFormatter: => Unit): PartialFunction[Throwable, DateTimeFormatter] = { + case e: IllegalArgumentException => + try { + tryLegacyFormatter + } catch { + case _: Throwable => throw e + } + throw new SparkUpgradeException("3.0", s"Fail to recognize '$pattern' pattern in the" + + s" DateTimeFormatter. 1) You can set ${SQLConf.LEGACY_TIME_PARSER_POLICY.key} to LEGACY" + + s" to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern" + + s" with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html", + e) + } } private object DateTimeFormatterHelper { @@ -190,6 +218,8 @@ private object DateTimeFormatterHelper { } final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p') + final val unsupportedNarrowTextStyle = + Set("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "QQQQQ", "qqqqq", "uuuuu") /** * In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for @@ -211,6 +241,9 @@ private object DateTimeFormatterHelper { for (c <- patternPart if unsupportedLetters.contains(c)) { throw new IllegalArgumentException(s"Illegal pattern character: $c") } + for (style <- unsupportedNarrowTextStyle if patternPart.contains(style)) { + throw new IllegalArgumentException(s"Too many pattern letters: ${style.head}") + } // The meaning of 'u' was day number of week in SimpleDateFormat, it was changed to year // in DateTimeFormatter. Substitute 'u' to 'e' and use DateTimeFormatter to parse the // string. If parsable, return the result; otherwise, fall back to 'u', and then use the diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala index 1a6e5e4400ffb..de2fd312b7db5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala @@ -54,6 +54,7 @@ sealed trait TimestampFormatter extends Serializable { def format(us: Long): String def format(ts: Timestamp): String def format(instant: Instant): String + def validatePatternString(): Unit } class Iso8601TimestampFormatter( @@ -99,6 +100,12 @@ class Iso8601TimestampFormatter( override def format(ts: Timestamp): String = { legacyFormatter.format(ts) } + + override def validatePatternString(): Unit = { + try { + formatter + } catch checkLegacyFormatter(pattern, legacyFormatter.validatePatternString) + } } /** @@ -202,6 +209,8 @@ class LegacyFastTimestampFormatter( override def format(instant: Instant): String = { format(instantToMicros(instant)) } + + override def validatePatternString(): Unit = fastDateFormat } class LegacySimpleTimestampFormatter( @@ -231,6 +240,8 @@ class LegacySimpleTimestampFormatter( override def format(instant: Instant): String = { format(instantToMicros(instant)) } + + override def validatePatternString(): Unit = sdf } object LegacyDateFormats extends Enumeration { @@ -255,8 +266,10 @@ object TimestampFormatter { if (SQLConf.get.legacyTimeParserPolicy == LEGACY) { getLegacyFormatter(pattern, zoneId, locale, legacyFormat) } else { - new Iso8601TimestampFormatter( + val tf = new Iso8601TimestampFormatter( pattern, zoneId, locale, legacyFormat, needVarLengthSecondFraction) + tf.validatePatternString() + tf } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala index 87062f2d4ef38..02d6d847dc063 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala @@ -267,7 +267,7 @@ class DateExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { // Test escaping of format GenerateUnsafeProjection.generate( - DateFormatClass(Literal(ts), Literal("\"quote"), JST_OPT) :: Nil) + DateFormatClass(Literal(ts), Literal("\""), JST_OPT) :: Nil) // SPARK-28072 The codegen path should work checkEvaluation( diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala index 817e503584324..caf7bdde10122 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.catalyst.util -import org.apache.spark.SparkFunSuite +import org.apache.spark.{SparkFunSuite, SparkUpgradeException} import org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper._ class DateTimeFormatterHelperSuite extends SparkFunSuite { @@ -40,6 +40,16 @@ class DateTimeFormatterHelperSuite extends SparkFunSuite { val e = intercept[IllegalArgumentException](convertIncompatiblePattern(s"yyyy-MM-dd $l G")) assert(e.getMessage === s"Illegal pattern character: $l") } + unsupportedNarrowTextStyle.foreach { style => + val e1 = intercept[IllegalArgumentException] { + convertIncompatiblePattern(s"yyyy-MM-dd $style") + } + assert(e1.getMessage === s"Too many pattern letters: ${style.head}") + val e2 = intercept[IllegalArgumentException] { + convertIncompatiblePattern(s"yyyy-MM-dd $style${style.head}") + } + assert(e2.getMessage === s"Too many pattern letters: ${style.head}") + } assert(convertIncompatiblePattern("yyyy-MM-dd uuuu") === "uuuu-MM-dd eeee") assert(convertIncompatiblePattern("yyyy-MM-dd EEEE") === "uuuu-MM-dd EEEE") assert(convertIncompatiblePattern("yyyy-MM-dd'e'HH:mm:ss") === "uuuu-MM-dd'e'HH:mm:ss") diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala index dccb3defe3728..4324d3cff63d7 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala @@ -396,4 +396,15 @@ class TimestampFormatterSuite extends SparkFunSuite with SQLHelper with Matchers val micros = formatter.parse("2009 11") assert(micros === date(2009, 1, 1, 11)) } + + test("explicitly forbidden datetime patterns") { + // not support by the legacy one too + Seq("QQQQQ", "qqqqq", "A", "c", "e", "n", "N", "p").foreach { pattern => + intercept[IllegalArgumentException](TimestampFormatter(pattern, UTC).format(0)) + } + // supported by the legacy one, then we will suggest users with SparkUpgradeException + Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa").foreach { pattern => + intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0)) + } + } } diff --git a/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt b/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt index 147a77ff098d0..0e82b632793d2 100644 --- a/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance ================================================================================================ -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -One quoted string 24907 29374 NaN 0.0 498130.5 1.0X +One quoted string 46568 46683 198 0.0 931358.6 1.0X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 1000 columns 62811 63690 1416 0.0 62811.4 1.0X -Select 100 columns 23839 24064 230 0.0 23839.5 2.6X -Select one column 19936 20641 827 0.1 19936.4 3.2X -count() 4174 4380 206 0.2 4174.4 15.0X -Select 100 columns, one bad input field 41015 42380 1688 0.0 41015.4 1.5X -Select 100 columns, corrupt record field 46281 46338 93 0.0 46280.5 1.4X +Select 1000 columns 129836 130796 1404 0.0 129836.0 1.0X +Select 100 columns 40444 40679 261 0.0 40443.5 3.2X +Select one column 33429 33475 73 0.0 33428.6 3.9X +count() 7967 8047 73 0.1 7966.7 16.3X +Select 100 columns, one bad input field 90639 90832 266 0.0 90638.6 1.4X +Select 100 columns, corrupt record field 109023 109084 74 0.0 109023.3 1.2X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns + count() 10810 10997 163 0.9 1081.0 1.0X -Select 1 column + count() 7608 7641 47 1.3 760.8 1.4X -count() 2415 2462 77 4.1 241.5 4.5X +Select 10 columns + count() 20685 20707 35 0.5 2068.5 1.0X +Select 1 column + count() 13096 13149 49 0.8 1309.6 1.6X +count() 3994 4001 7 2.5 399.4 5.2X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 874 914 37 11.4 87.4 1.0X -to_csv(timestamp) 7051 7223 250 1.4 705.1 0.1X -write timestamps to files 6712 6741 31 1.5 671.2 0.1X -Create a dataset of dates 909 945 35 11.0 90.9 1.0X -to_csv(date) 4222 4231 8 2.4 422.2 0.2X -write dates to files 3799 3813 14 2.6 379.9 0.2X +Create a dataset of timestamps 2169 2203 32 4.6 216.9 1.0X +to_csv(timestamp) 14401 14591 168 0.7 1440.1 0.2X +write timestamps to files 13209 13276 59 0.8 1320.9 0.2X +Create a dataset of dates 2231 2248 17 4.5 223.1 1.0X +to_csv(date) 10406 10473 68 1.0 1040.6 0.2X +write dates to files 7970 7976 9 1.3 797.0 0.3X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 1342 1364 35 7.5 134.2 1.0X -read timestamps from files 20300 20473 247 0.5 2030.0 0.1X -infer timestamps from files 40705 40744 54 0.2 4070.5 0.0X -read date text from files 1146 1151 6 8.7 114.6 1.2X -read date from files 12278 12408 117 0.8 1227.8 0.1X -infer date from files 12734 12872 220 0.8 1273.4 0.1X -timestamp strings 1467 1482 15 6.8 146.7 0.9X -parse timestamps from Dataset[String] 21708 22234 477 0.5 2170.8 0.1X -infer timestamps from Dataset[String] 42357 43253 922 0.2 4235.7 0.0X -date strings 1512 1532 18 6.6 151.2 0.9X -parse dates from Dataset[String] 13436 13470 33 0.7 1343.6 0.1X -from_csv(timestamp) 20390 20486 95 0.5 2039.0 0.1X -from_csv(date) 12592 12693 139 0.8 1259.2 0.1X +read timestamp text from files 2387 2391 6 4.2 238.7 1.0X +read timestamps from files 53503 53593 124 0.2 5350.3 0.0X +infer timestamps from files 107988 108668 647 0.1 10798.8 0.0X +read date text from files 2121 2133 12 4.7 212.1 1.1X +read date from files 29983 30039 48 0.3 2998.3 0.1X +infer date from files 30196 30436 218 0.3 3019.6 0.1X +timestamp strings 3098 3109 10 3.2 309.8 0.8X +parse timestamps from Dataset[String] 63331 63426 84 0.2 6333.1 0.0X +infer timestamps from Dataset[String] 124003 124463 490 0.1 12400.3 0.0X +date strings 3423 3429 11 2.9 342.3 0.7X +parse dates from Dataset[String] 34235 34314 76 0.3 3423.5 0.1X +from_csv(timestamp) 60829 61600 668 0.2 6082.9 0.0X +from_csv(date) 33047 33173 139 0.3 3304.7 0.1X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -w/o filters 12535 12606 67 0.0 125348.8 1.0X -pushdown disabled 12611 12672 91 0.0 126112.9 1.0X -w/ filters 1093 1099 11 0.1 10928.3 11.5X +w/o filters 28752 28765 16 0.0 287516.5 1.0X +pushdown disabled 28856 28880 22 0.0 288556.3 1.0X +w/ filters 1714 1731 15 0.1 17137.3 16.8X diff --git a/sql/core/benchmarks/CSVBenchmark-results.txt b/sql/core/benchmarks/CSVBenchmark-results.txt index 498ca4caa0e45..a3af46c037bf9 100644 --- a/sql/core/benchmarks/CSVBenchmark-results.txt +++ b/sql/core/benchmarks/CSVBenchmark-results.txt @@ -2,66 +2,66 @@ Benchmark to measure CSV read/write performance ================================================================================================ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -One quoted string 24073 24109 33 0.0 481463.5 1.0X +One quoted string 45457 45731 344 0.0 909136.8 1.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 1000 columns 58415 59611 2071 0.0 58414.8 1.0X -Select 100 columns 22568 23020 594 0.0 22568.0 2.6X -Select one column 18995 19058 99 0.1 18995.0 3.1X -count() 5301 5332 30 0.2 5300.9 11.0X -Select 100 columns, one bad input field 39736 40153 361 0.0 39736.1 1.5X -Select 100 columns, corrupt record field 47195 47826 590 0.0 47195.2 1.2X +Select 1000 columns 129646 130527 1412 0.0 129646.3 1.0X +Select 100 columns 42444 42551 119 0.0 42444.0 3.1X +Select one column 35415 35428 20 0.0 35414.6 3.7X +count() 11114 11128 16 0.1 11113.6 11.7X +Select 100 columns, one bad input field 93353 93670 275 0.0 93352.6 1.4X +Select 100 columns, corrupt record field 113569 113952 373 0.0 113568.8 1.1X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns + count() 9884 9904 25 1.0 988.4 1.0X -Select 1 column + count() 6794 6835 46 1.5 679.4 1.5X -count() 2060 2065 5 4.9 206.0 4.8X +Select 10 columns + count() 18498 18589 87 0.5 1849.8 1.0X +Select 1 column + count() 11078 11095 27 0.9 1107.8 1.7X +count() 3928 3950 22 2.5 392.8 4.7X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 717 732 18 14.0 71.7 1.0X -to_csv(timestamp) 6994 7100 121 1.4 699.4 0.1X -write timestamps to files 6417 6435 27 1.6 641.7 0.1X -Create a dataset of dates 827 855 24 12.1 82.7 0.9X -to_csv(date) 4408 4438 32 2.3 440.8 0.2X -write dates to files 3738 3758 28 2.7 373.8 0.2X +Create a dataset of timestamps 1933 1940 11 5.2 193.3 1.0X +to_csv(timestamp) 18078 18243 255 0.6 1807.8 0.1X +write timestamps to files 12668 12786 134 0.8 1266.8 0.2X +Create a dataset of dates 2196 2201 5 4.6 219.6 0.9X +to_csv(date) 9583 9597 21 1.0 958.3 0.2X +write dates to files 7091 7110 20 1.4 709.1 0.3X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 1121 1176 52 8.9 112.1 1.0X -read timestamps from files 21298 21366 105 0.5 2129.8 0.1X -infer timestamps from files 41008 41051 39 0.2 4100.8 0.0X -read date text from files 962 967 5 10.4 96.2 1.2X -read date from files 11749 11772 22 0.9 1174.9 0.1X -infer date from files 12426 12459 29 0.8 1242.6 0.1X -timestamp strings 1508 1519 9 6.6 150.8 0.7X -parse timestamps from Dataset[String] 21674 21997 455 0.5 2167.4 0.1X -infer timestamps from Dataset[String] 42141 42230 105 0.2 4214.1 0.0X -date strings 1694 1701 8 5.9 169.4 0.7X -parse dates from Dataset[String] 12929 12951 25 0.8 1292.9 0.1X -from_csv(timestamp) 20603 20786 166 0.5 2060.3 0.1X -from_csv(date) 12325 12338 12 0.8 1232.5 0.1X +read timestamp text from files 2166 2177 10 4.6 216.6 1.0X +read timestamps from files 53212 53402 281 0.2 5321.2 0.0X +infer timestamps from files 109788 110372 570 0.1 10978.8 0.0X +read date text from files 1921 1929 8 5.2 192.1 1.1X +read date from files 25470 25499 25 0.4 2547.0 0.1X +infer date from files 27201 27342 134 0.4 2720.1 0.1X +timestamp strings 3638 3653 19 2.7 363.8 0.6X +parse timestamps from Dataset[String] 61894 62532 555 0.2 6189.4 0.0X +infer timestamps from Dataset[String] 125171 125430 236 0.1 12517.1 0.0X +date strings 3736 3749 14 2.7 373.6 0.6X +parse dates from Dataset[String] 30787 30829 43 0.3 3078.7 0.1X +from_csv(timestamp) 60842 61035 209 0.2 6084.2 0.0X +from_csv(date) 30123 30196 95 0.3 3012.3 0.1X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -w/o filters 12455 12474 22 0.0 124553.8 1.0X -pushdown disabled 12462 12486 29 0.0 124624.9 1.0X -w/ filters 1073 1092 18 0.1 10727.6 11.6X +w/o filters 28985 29042 80 0.0 289852.9 1.0X +pushdown disabled 29080 29146 58 0.0 290799.4 1.0X +w/ filters 2072 2084 17 0.0 20722.3 14.0X diff --git a/sql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt b/sql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt index 61ca342a0d559..f4ed8ce4afaea 100644 --- a/sql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt @@ -6,18 +6,18 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz datetime +/- interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date + interval(m) 1496 1569 104 6.7 149.6 1.0X -date + interval(m, d) 1514 1526 17 6.6 151.4 1.0X -date + interval(m, d, ms) 6231 6253 30 1.6 623.1 0.2X -date - interval(m) 1481 1487 9 6.8 148.1 1.0X -date - interval(m, d) 1550 1552 2 6.5 155.0 1.0X -date - interval(m, d, ms) 6269 6272 4 1.6 626.9 0.2X -timestamp + interval(m) 3017 3056 54 3.3 301.7 0.5X -timestamp + interval(m, d) 3146 3148 3 3.2 314.6 0.5X -timestamp + interval(m, d, ms) 3446 3460 20 2.9 344.6 0.4X -timestamp - interval(m) 3045 3059 19 3.3 304.5 0.5X -timestamp - interval(m, d) 3147 3164 25 3.2 314.7 0.5X -timestamp - interval(m, d, ms) 3425 3442 25 2.9 342.5 0.4X +date + interval(m) 1660 1745 120 6.0 166.0 1.0X +date + interval(m, d) 1672 1685 19 6.0 167.2 1.0X +date + interval(m, d, ms) 6462 6481 27 1.5 646.2 0.3X +date - interval(m) 1456 1480 35 6.9 145.6 1.1X +date - interval(m, d) 1501 1509 11 6.7 150.1 1.1X +date - interval(m, d, ms) 6457 6466 12 1.5 645.7 0.3X +timestamp + interval(m) 2941 2944 4 3.4 294.1 0.6X +timestamp + interval(m, d) 3008 3012 6 3.3 300.8 0.6X +timestamp + interval(m, d, ms) 3329 3333 6 3.0 332.9 0.5X +timestamp - interval(m) 2964 2982 26 3.4 296.4 0.6X +timestamp - interval(m, d) 3030 3039 13 3.3 303.0 0.5X +timestamp - interval(m, d, ms) 3312 3313 1 3.0 331.2 0.5X ================================================================================================ @@ -28,92 +28,92 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast to timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast to timestamp wholestage off 332 336 5 30.1 33.2 1.0X -cast to timestamp wholestage on 333 344 10 30.0 33.3 1.0X +cast to timestamp wholestage off 333 334 0 30.0 33.3 1.0X +cast to timestamp wholestage on 349 368 12 28.6 34.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz year of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -year of timestamp wholestage off 1246 1257 16 8.0 124.6 1.0X -year of timestamp wholestage on 1209 1218 12 8.3 120.9 1.0X +year of timestamp wholestage off 1229 1229 1 8.1 122.9 1.0X +year of timestamp wholestage on 1218 1223 5 8.2 121.8 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz quarter of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -quarter of timestamp wholestage off 1608 1616 11 6.2 160.8 1.0X -quarter of timestamp wholestage on 1540 1552 10 6.5 154.0 1.0X +quarter of timestamp wholestage off 1593 1594 2 6.3 159.3 1.0X +quarter of timestamp wholestage on 1515 1529 14 6.6 151.5 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz month of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -month of timestamp wholestage off 1242 1246 6 8.1 124.2 1.0X -month of timestamp wholestage on 1202 1212 11 8.3 120.2 1.0X +month of timestamp wholestage off 1222 1246 34 8.2 122.2 1.0X +month of timestamp wholestage on 1207 1232 31 8.3 120.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz weekofyear of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -weekofyear of timestamp wholestage off 1879 1885 8 5.3 187.9 1.0X -weekofyear of timestamp wholestage on 1832 1845 10 5.5 183.2 1.0X +weekofyear of timestamp wholestage off 2453 2455 2 4.1 245.3 1.0X +weekofyear of timestamp wholestage on 2357 2380 22 4.2 235.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz day of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -day of timestamp wholestage off 1236 1239 4 8.1 123.6 1.0X -day of timestamp wholestage on 1206 1219 17 8.3 120.6 1.0X +day of timestamp wholestage off 1216 1219 5 8.2 121.6 1.0X +day of timestamp wholestage on 1205 1221 25 8.3 120.5 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofyear of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofyear of timestamp wholestage off 1308 1309 1 7.6 130.8 1.0X -dayofyear of timestamp wholestage on 1239 1255 15 8.1 123.9 1.1X +dayofyear of timestamp wholestage off 1268 1274 9 7.9 126.8 1.0X +dayofyear of timestamp wholestage on 1253 1268 10 8.0 125.3 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofmonth of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofmonth of timestamp wholestage off 1259 1263 5 7.9 125.9 1.0X -dayofmonth of timestamp wholestage on 1201 1205 5 8.3 120.1 1.0X +dayofmonth of timestamp wholestage off 1223 1224 1 8.2 122.3 1.0X +dayofmonth of timestamp wholestage on 1231 1246 14 8.1 123.1 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofweek of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofweek of timestamp wholestage off 1406 1410 6 7.1 140.6 1.0X -dayofweek of timestamp wholestage on 1387 1402 15 7.2 138.7 1.0X +dayofweek of timestamp wholestage off 1398 1406 12 7.2 139.8 1.0X +dayofweek of timestamp wholestage on 1387 1399 15 7.2 138.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz weekday of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -weekday of timestamp wholestage off 1355 1367 18 7.4 135.5 1.0X -weekday of timestamp wholestage on 1311 1321 10 7.6 131.1 1.0X +weekday of timestamp wholestage off 1327 1333 9 7.5 132.7 1.0X +weekday of timestamp wholestage on 1329 1333 4 7.5 132.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz hour of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -hour of timestamp wholestage off 996 997 2 10.0 99.6 1.0X -hour of timestamp wholestage on 930 936 6 10.7 93.0 1.1X +hour of timestamp wholestage off 1005 1016 15 9.9 100.5 1.0X +hour of timestamp wholestage on 934 940 4 10.7 93.4 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz minute of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -minute of timestamp wholestage off 1005 1012 10 9.9 100.5 1.0X -minute of timestamp wholestage on 949 952 3 10.5 94.9 1.1X +minute of timestamp wholestage off 1003 1009 8 10.0 100.3 1.0X +minute of timestamp wholestage on 934 938 7 10.7 93.4 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz second of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -second of timestamp wholestage off 1013 1014 1 9.9 101.3 1.0X -second of timestamp wholestage on 933 934 2 10.7 93.3 1.1X +second of timestamp wholestage off 997 998 2 10.0 99.7 1.0X +second of timestamp wholestage on 925 935 8 10.8 92.5 1.1X ================================================================================================ @@ -124,15 +124,15 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz current_date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -current_date wholestage off 291 293 2 34.3 29.1 1.0X -current_date wholestage on 280 284 3 35.7 28.0 1.0X +current_date wholestage off 297 297 0 33.7 29.7 1.0X +current_date wholestage on 280 282 2 35.7 28.0 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz current_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -current_timestamp wholestage off 311 324 18 32.1 31.1 1.0X -current_timestamp wholestage on 275 364 85 36.3 27.5 1.1X +current_timestamp wholestage off 307 337 43 32.6 30.7 1.0X +current_timestamp wholestage on 260 284 29 38.4 26.0 1.2X ================================================================================================ @@ -143,43 +143,43 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast to date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast to date wholestage off 1077 1079 3 9.3 107.7 1.0X -cast to date wholestage on 1018 1030 14 9.8 101.8 1.1X +cast to date wholestage off 1066 1073 10 9.4 106.6 1.0X +cast to date wholestage on 997 1003 6 10.0 99.7 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz last_day: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -last_day wholestage off 1257 1260 4 8.0 125.7 1.0X -last_day wholestage on 1218 1227 14 8.2 121.8 1.0X +last_day wholestage off 1238 1242 6 8.1 123.8 1.0X +last_day wholestage on 1259 1272 12 7.9 125.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz next_day: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -next_day wholestage off 1140 1141 1 8.8 114.0 1.0X -next_day wholestage on 1067 1076 11 9.4 106.7 1.1X +next_day wholestage off 1116 1138 32 9.0 111.6 1.0X +next_day wholestage on 1052 1063 11 9.5 105.2 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_add: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_add wholestage off 1062 1064 3 9.4 106.2 1.0X -date_add wholestage on 1046 1055 11 9.6 104.6 1.0X +date_add wholestage off 1048 1049 1 9.5 104.8 1.0X +date_add wholestage on 1035 1039 3 9.7 103.5 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_sub: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_sub wholestage off 1082 1083 1 9.2 108.2 1.0X -date_sub wholestage on 1047 1056 12 9.6 104.7 1.0X +date_sub wholestage off 1119 1127 11 8.9 111.9 1.0X +date_sub wholestage on 1028 1039 7 9.7 102.8 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz add_months: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -add_months wholestage off 1430 1431 1 7.0 143.0 1.0X -add_months wholestage on 1441 1446 8 6.9 144.1 1.0X +add_months wholestage off 1421 1421 0 7.0 142.1 1.0X +add_months wholestage on 1423 1434 11 7.0 142.3 1.0X ================================================================================================ @@ -190,8 +190,8 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz format date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -format date wholestage off 5442 5549 150 1.8 544.2 1.0X -format date wholestage on 5529 5655 236 1.8 552.9 1.0X +format date wholestage off 5293 5296 5 1.9 529.3 1.0X +format date wholestage on 5143 5157 19 1.9 514.3 1.0X ================================================================================================ @@ -202,8 +202,8 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz from_unixtime: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -from_unixtime wholestage off 7416 7440 34 1.3 741.6 1.0X -from_unixtime wholestage on 7372 7391 17 1.4 737.2 1.0X +from_unixtime wholestage off 7136 7136 1 1.4 713.6 1.0X +from_unixtime wholestage on 7049 7068 29 1.4 704.9 1.0X ================================================================================================ @@ -214,15 +214,15 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz from_utc_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -from_utc_timestamp wholestage off 1316 1320 6 7.6 131.6 1.0X -from_utc_timestamp wholestage on 1268 1272 4 7.9 126.8 1.0X +from_utc_timestamp wholestage off 1325 1329 6 7.5 132.5 1.0X +from_utc_timestamp wholestage on 1269 1273 4 7.9 126.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_utc_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_utc_timestamp wholestage off 1653 1657 6 6.0 165.3 1.0X -to_utc_timestamp wholestage on 1594 1599 4 6.3 159.4 1.0X +to_utc_timestamp wholestage off 1684 1691 10 5.9 168.4 1.0X +to_utc_timestamp wholestage on 1641 1648 9 6.1 164.1 1.0X ================================================================================================ @@ -233,29 +233,29 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast interval wholestage off 341 343 3 29.4 34.1 1.0X -cast interval wholestage on 279 282 1 35.8 27.9 1.2X +cast interval wholestage off 343 346 4 29.1 34.3 1.0X +cast interval wholestage on 281 282 1 35.6 28.1 1.2X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz datediff: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -datediff wholestage off 1862 1865 4 5.4 186.2 1.0X -datediff wholestage on 1769 1783 15 5.7 176.9 1.1X +datediff wholestage off 1831 1840 13 5.5 183.1 1.0X +datediff wholestage on 1759 1769 15 5.7 175.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz months_between: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -months_between wholestage off 5594 5599 7 1.8 559.4 1.0X -months_between wholestage on 5498 5508 11 1.8 549.8 1.0X +months_between wholestage off 5729 5747 25 1.7 572.9 1.0X +months_between wholestage on 5710 5720 9 1.8 571.0 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz window: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -window wholestage off 2044 2127 117 0.5 2044.3 1.0X -window wholestage on 48057 48109 54 0.0 48056.9 0.0X +window wholestage off 2183 2189 9 0.5 2182.6 1.0X +window wholestage on 46835 46944 88 0.0 46834.8 0.0X ================================================================================================ @@ -266,134 +266,134 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YEAR: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YEAR wholestage off 2540 2542 3 3.9 254.0 1.0X -date_trunc YEAR wholestage on 2486 2507 29 4.0 248.6 1.0X +date_trunc YEAR wholestage off 2668 2672 5 3.7 266.8 1.0X +date_trunc YEAR wholestage on 2719 2731 9 3.7 271.9 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YYYY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YYYY wholestage off 2542 2543 3 3.9 254.2 1.0X -date_trunc YYYY wholestage on 2491 2498 9 4.0 249.1 1.0X +date_trunc YYYY wholestage off 2672 2677 8 3.7 267.2 1.0X +date_trunc YYYY wholestage on 2710 2726 12 3.7 271.0 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YY wholestage off 2545 2569 35 3.9 254.5 1.0X -date_trunc YY wholestage on 2487 2493 4 4.0 248.7 1.0X +date_trunc YY wholestage off 2670 2673 4 3.7 267.0 1.0X +date_trunc YY wholestage on 2711 2720 7 3.7 271.1 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MON: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MON wholestage off 2590 2590 1 3.9 259.0 1.0X -date_trunc MON wholestage on 2506 2520 12 4.0 250.6 1.0X +date_trunc MON wholestage off 2674 2674 0 3.7 267.4 1.0X +date_trunc MON wholestage on 2667 2677 10 3.7 266.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MONTH: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MONTH wholestage off 2595 2603 11 3.9 259.5 1.0X -date_trunc MONTH wholestage on 2505 2516 12 4.0 250.5 1.0X +date_trunc MONTH wholestage off 2675 2686 16 3.7 267.5 1.0X +date_trunc MONTH wholestage on 2667 2674 6 3.7 266.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MM: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MM wholestage off 2605 2612 10 3.8 260.5 1.0X -date_trunc MM wholestage on 2501 2515 11 4.0 250.1 1.0X +date_trunc MM wholestage off 2673 2674 1 3.7 267.3 1.0X +date_trunc MM wholestage on 2664 2669 4 3.8 266.4 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc DAY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc DAY wholestage off 2225 2229 5 4.5 222.5 1.0X -date_trunc DAY wholestage on 2184 2196 9 4.6 218.4 1.0X +date_trunc DAY wholestage off 2281 2288 10 4.4 228.1 1.0X +date_trunc DAY wholestage on 2302 2312 8 4.3 230.2 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc DD: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc DD wholestage off 2232 2236 6 4.5 223.2 1.0X -date_trunc DD wholestage on 2183 2190 6 4.6 218.3 1.0X +date_trunc DD wholestage off 2281 2283 3 4.4 228.1 1.0X +date_trunc DD wholestage on 2291 2302 11 4.4 229.1 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc HOUR: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc HOUR wholestage off 2194 2199 7 4.6 219.4 1.0X -date_trunc HOUR wholestage on 2160 2166 5 4.6 216.0 1.0X +date_trunc HOUR wholestage off 2331 2332 1 4.3 233.1 1.0X +date_trunc HOUR wholestage on 2290 2304 11 4.4 229.0 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MINUTE: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MINUTE wholestage off 390 396 9 25.7 39.0 1.0X -date_trunc MINUTE wholestage on 331 337 7 30.2 33.1 1.2X +date_trunc MINUTE wholestage off 379 385 9 26.4 37.9 1.0X +date_trunc MINUTE wholestage on 371 376 5 27.0 37.1 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc SECOND: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc SECOND wholestage off 375 381 8 26.7 37.5 1.0X -date_trunc SECOND wholestage on 332 346 14 30.1 33.2 1.1X +date_trunc SECOND wholestage off 375 376 1 26.7 37.5 1.0X +date_trunc SECOND wholestage on 370 376 8 27.0 37.0 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc WEEK: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc WEEK wholestage off 2439 2443 6 4.1 243.9 1.0X -date_trunc WEEK wholestage on 2390 2409 32 4.2 239.0 1.0X +date_trunc WEEK wholestage off 2597 2604 10 3.9 259.7 1.0X +date_trunc WEEK wholestage on 2591 2605 13 3.9 259.1 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc QUARTER: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc QUARTER wholestage off 3290 3292 4 3.0 329.0 1.0X -date_trunc QUARTER wholestage on 3214 3218 3 3.1 321.4 1.0X +date_trunc QUARTER wholestage off 3501 3511 14 2.9 350.1 1.0X +date_trunc QUARTER wholestage on 3477 3489 9 2.9 347.7 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc year: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc year wholestage off 308 310 3 32.5 30.8 1.0X -trunc year wholestage on 289 293 6 34.7 28.9 1.1X +trunc year wholestage off 332 334 3 30.1 33.2 1.0X +trunc year wholestage on 332 346 17 30.1 33.2 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc yyyy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc yyyy wholestage off 309 311 3 32.4 30.9 1.0X -trunc yyyy wholestage on 289 294 7 34.6 28.9 1.1X +trunc yyyy wholestage off 331 331 0 30.2 33.1 1.0X +trunc yyyy wholestage on 336 339 4 29.8 33.6 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc yy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc yy wholestage off 311 311 0 32.2 31.1 1.0X -trunc yy wholestage on 288 294 7 34.7 28.8 1.1X +trunc yy wholestage off 330 342 17 30.3 33.0 1.0X +trunc yy wholestage on 333 337 3 30.0 33.3 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc mon: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc mon wholestage off 313 313 0 32.0 31.3 1.0X -trunc mon wholestage on 287 290 2 34.8 28.7 1.1X +trunc mon wholestage off 334 335 1 30.0 33.4 1.0X +trunc mon wholestage on 333 347 9 30.0 33.3 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc month: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc month wholestage off 310 310 0 32.3 31.0 1.0X -trunc month wholestage on 287 290 2 34.8 28.7 1.1X +trunc month wholestage off 332 333 1 30.1 33.2 1.0X +trunc month wholestage on 333 340 7 30.0 33.3 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc mm: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc mm wholestage off 311 312 1 32.1 31.1 1.0X -trunc mm wholestage on 287 296 9 34.8 28.7 1.1X +trunc mm wholestage off 328 336 11 30.5 32.8 1.0X +trunc mm wholestage on 333 343 11 30.0 33.3 1.0X ================================================================================================ @@ -404,36 +404,36 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to timestamp str: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to timestamp str wholestage off 169 170 1 5.9 168.9 1.0X -to timestamp str wholestage on 161 168 11 6.2 161.0 1.0X +to timestamp str wholestage off 170 171 1 5.9 170.1 1.0X +to timestamp str wholestage on 172 174 2 5.8 171.6 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_timestamp wholestage off 1360 1361 1 0.7 1359.6 1.0X -to_timestamp wholestage on 1362 1366 6 0.7 1362.0 1.0X +to_timestamp wholestage off 1437 1439 3 0.7 1437.0 1.0X +to_timestamp wholestage on 1288 1292 5 0.8 1288.1 1.1X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_unix_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_unix_timestamp wholestage off 1343 1346 4 0.7 1342.6 1.0X -to_unix_timestamp wholestage on 1356 1359 2 0.7 1356.2 1.0X +to_unix_timestamp wholestage off 1352 1353 2 0.7 1352.0 1.0X +to_unix_timestamp wholestage on 1314 1319 5 0.8 1314.4 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to date str: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to date str wholestage off 227 230 4 4.4 227.0 1.0X -to date str wholestage on 299 302 3 3.3 299.0 0.8X +to date str wholestage off 211 215 6 4.7 210.7 1.0X +to date str wholestage on 217 217 1 4.6 216.5 1.0X OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_date wholestage off 3413 3440 38 0.3 3413.0 1.0X -to_date wholestage on 3392 3402 12 0.3 3392.3 1.0X +to_date wholestage off 3281 3295 20 0.3 3280.9 1.0X +to_date wholestage on 3223 3239 17 0.3 3222.8 1.0X ================================================================================================ @@ -444,14 +444,14 @@ OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-106 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -From java.sql.Date 410 415 7 12.2 82.0 1.0X -From java.time.LocalDate 332 333 1 15.1 66.4 1.2X -Collect java.sql.Date 1891 2542 829 2.6 378.1 0.2X -Collect java.time.LocalDate 1630 2138 441 3.1 326.0 0.3X -From java.sql.Timestamp 254 259 6 19.7 50.9 1.6X -From java.time.Instant 302 306 4 16.6 60.3 1.4X -Collect longs 1134 1265 117 4.4 226.8 0.4X -Collect java.sql.Timestamp 1441 1458 16 3.5 288.1 0.3X -Collect java.time.Instant 1680 1928 253 3.0 336.0 0.2X +From java.sql.Date 446 447 1 11.2 89.1 1.0X +From java.time.LocalDate 354 356 1 14.1 70.8 1.3X +Collect java.sql.Date 2722 3091 495 1.8 544.4 0.2X +Collect java.time.LocalDate 1786 1836 60 2.8 357.2 0.2X +From java.sql.Timestamp 275 287 19 18.2 55.0 1.6X +From java.time.Instant 325 328 3 15.4 65.0 1.4X +Collect longs 1300 1321 25 3.8 260.0 0.3X +Collect java.sql.Timestamp 1450 1557 102 3.4 290.0 0.3X +Collect java.time.Instant 1499 1599 87 3.3 299.9 0.3X diff --git a/sql/core/benchmarks/DateTimeBenchmark-results.txt b/sql/core/benchmarks/DateTimeBenchmark-results.txt index 7586295778bd8..7a9aa4badfeb7 100644 --- a/sql/core/benchmarks/DateTimeBenchmark-results.txt +++ b/sql/core/benchmarks/DateTimeBenchmark-results.txt @@ -6,18 +6,18 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz datetime +/- interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date + interval(m) 1638 1701 89 6.1 163.8 1.0X -date + interval(m, d) 1785 1790 7 5.6 178.5 0.9X -date + interval(m, d, ms) 6229 6270 58 1.6 622.9 0.3X -date - interval(m) 1500 1503 4 6.7 150.0 1.1X -date - interval(m, d) 1764 1766 3 5.7 176.4 0.9X -date - interval(m, d, ms) 6428 6446 25 1.6 642.8 0.3X -timestamp + interval(m) 2719 2722 4 3.7 271.9 0.6X -timestamp + interval(m, d) 3011 3021 14 3.3 301.1 0.5X -timestamp + interval(m, d, ms) 3405 3412 9 2.9 340.5 0.5X -timestamp - interval(m) 2759 2764 7 3.6 275.9 0.6X -timestamp - interval(m, d) 3094 3112 25 3.2 309.4 0.5X -timestamp - interval(m, d, ms) 3388 3392 5 3.0 338.8 0.5X +date + interval(m) 1555 1634 113 6.4 155.5 1.0X +date + interval(m, d) 1774 1797 33 5.6 177.4 0.9X +date + interval(m, d, ms) 6293 6335 59 1.6 629.3 0.2X +date - interval(m) 1461 1468 10 6.8 146.1 1.1X +date - interval(m, d) 1741 1741 0 5.7 174.1 0.9X +date - interval(m, d, ms) 6503 6518 21 1.5 650.3 0.2X +timestamp + interval(m) 2384 2385 1 4.2 238.4 0.7X +timestamp + interval(m, d) 2683 2684 2 3.7 268.3 0.6X +timestamp + interval(m, d, ms) 2987 3001 19 3.3 298.7 0.5X +timestamp - interval(m) 2391 2395 5 4.2 239.1 0.7X +timestamp - interval(m, d) 2674 2684 14 3.7 267.4 0.6X +timestamp - interval(m, d, ms) 3005 3007 3 3.3 300.5 0.5X ================================================================================================ @@ -28,92 +28,92 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast to timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast to timestamp wholestage off 319 323 6 31.4 31.9 1.0X -cast to timestamp wholestage on 304 311 8 32.9 30.4 1.0X +cast to timestamp wholestage off 313 320 10 31.9 31.3 1.0X +cast to timestamp wholestage on 325 341 18 30.8 32.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz year of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -year of timestamp wholestage off 1234 1239 6 8.1 123.4 1.0X -year of timestamp wholestage on 1229 1244 22 8.1 122.9 1.0X +year of timestamp wholestage off 1216 1216 1 8.2 121.6 1.0X +year of timestamp wholestage on 1226 1243 13 8.2 122.6 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz quarter of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -quarter of timestamp wholestage off 1440 1445 7 6.9 144.0 1.0X -quarter of timestamp wholestage on 1358 1361 3 7.4 135.8 1.1X +quarter of timestamp wholestage off 1417 1421 5 7.1 141.7 1.0X +quarter of timestamp wholestage on 1358 1365 8 7.4 135.8 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz month of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -month of timestamp wholestage off 1239 1240 1 8.1 123.9 1.0X -month of timestamp wholestage on 1221 1239 26 8.2 122.1 1.0X +month of timestamp wholestage off 1219 1220 1 8.2 121.9 1.0X +month of timestamp wholestage on 1222 1227 7 8.2 122.2 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz weekofyear of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -weekofyear of timestamp wholestage off 1926 1934 11 5.2 192.6 1.0X -weekofyear of timestamp wholestage on 1901 1911 10 5.3 190.1 1.0X +weekofyear of timestamp wholestage off 1950 1950 0 5.1 195.0 1.0X +weekofyear of timestamp wholestage on 1890 1899 8 5.3 189.0 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz day of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -day of timestamp wholestage off 1225 1229 6 8.2 122.5 1.0X -day of timestamp wholestage on 1217 1225 7 8.2 121.7 1.0X +day of timestamp wholestage off 1212 1213 2 8.3 121.2 1.0X +day of timestamp wholestage on 1216 1227 13 8.2 121.6 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofyear of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofyear of timestamp wholestage off 1290 1295 7 7.8 129.0 1.0X -dayofyear of timestamp wholestage on 1262 1270 7 7.9 126.2 1.0X +dayofyear of timestamp wholestage off 1282 1284 3 7.8 128.2 1.0X +dayofyear of timestamp wholestage on 1269 1274 5 7.9 126.9 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofmonth of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofmonth of timestamp wholestage off 1239 1239 1 8.1 123.9 1.0X -dayofmonth of timestamp wholestage on 1215 1222 8 8.2 121.5 1.0X +dayofmonth of timestamp wholestage off 1214 1219 7 8.2 121.4 1.0X +dayofmonth of timestamp wholestage on 1216 1224 6 8.2 121.6 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz dayofweek of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -dayofweek of timestamp wholestage off 1421 1422 2 7.0 142.1 1.0X -dayofweek of timestamp wholestage on 1379 1388 8 7.3 137.9 1.0X +dayofweek of timestamp wholestage off 1403 1430 39 7.1 140.3 1.0X +dayofweek of timestamp wholestage on 1378 1386 8 7.3 137.8 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz weekday of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -weekday of timestamp wholestage off 1349 1351 2 7.4 134.9 1.0X -weekday of timestamp wholestage on 1320 1327 8 7.6 132.0 1.0X +weekday of timestamp wholestage off 1344 1353 13 7.4 134.4 1.0X +weekday of timestamp wholestage on 1316 1322 5 7.6 131.6 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz hour of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -hour of timestamp wholestage off 1024 1024 0 9.8 102.4 1.0X -hour of timestamp wholestage on 921 929 11 10.9 92.1 1.1X +hour of timestamp wholestage off 992 1000 10 10.1 99.2 1.0X +hour of timestamp wholestage on 960 962 3 10.4 96.0 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz minute of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -minute of timestamp wholestage off 977 982 6 10.2 97.7 1.0X -minute of timestamp wholestage on 927 929 2 10.8 92.7 1.1X +minute of timestamp wholestage off 989 1000 16 10.1 98.9 1.0X +minute of timestamp wholestage on 965 974 13 10.4 96.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz second of timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -second of timestamp wholestage off 987 989 3 10.1 98.7 1.0X -second of timestamp wholestage on 923 926 5 10.8 92.3 1.1X +second of timestamp wholestage off 974 977 5 10.3 97.4 1.0X +second of timestamp wholestage on 959 966 8 10.4 95.9 1.0X ================================================================================================ @@ -124,15 +124,15 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz current_date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -current_date wholestage off 303 311 12 33.0 30.3 1.0X -current_date wholestage on 266 271 5 37.5 26.6 1.1X +current_date wholestage off 281 282 2 35.6 28.1 1.0X +current_date wholestage on 294 300 5 34.0 29.4 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz current_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -current_timestamp wholestage off 297 297 1 33.7 29.7 1.0X -current_timestamp wholestage on 264 272 7 37.8 26.4 1.1X +current_timestamp wholestage off 282 296 19 35.4 28.2 1.0X +current_timestamp wholestage on 304 331 31 32.9 30.4 0.9X ================================================================================================ @@ -143,43 +143,43 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast to date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast to date wholestage off 1062 1063 2 9.4 106.2 1.0X -cast to date wholestage on 1007 1021 20 9.9 100.7 1.1X +cast to date wholestage off 1060 1061 1 9.4 106.0 1.0X +cast to date wholestage on 1021 1026 10 9.8 102.1 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz last_day: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -last_day wholestage off 1262 1265 5 7.9 126.2 1.0X -last_day wholestage on 1244 1256 14 8.0 124.4 1.0X +last_day wholestage off 1278 1280 3 7.8 127.8 1.0X +last_day wholestage on 1560 1566 6 6.4 156.0 0.8X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz next_day: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -next_day wholestage off 1119 1121 2 8.9 111.9 1.0X -next_day wholestage on 1057 1063 6 9.5 105.7 1.1X +next_day wholestage off 1091 1093 3 9.2 109.1 1.0X +next_day wholestage on 1070 1076 9 9.3 107.0 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_add: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_add wholestage off 1054 1059 7 9.5 105.4 1.0X -date_add wholestage on 1037 1069 52 9.6 103.7 1.0X +date_add wholestage off 1041 1047 8 9.6 104.1 1.0X +date_add wholestage on 1044 1050 4 9.6 104.4 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_sub: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_sub wholestage off 1054 1056 4 9.5 105.4 1.0X -date_sub wholestage on 1036 1040 4 9.7 103.6 1.0X +date_sub wholestage off 1038 1040 3 9.6 103.8 1.0X +date_sub wholestage on 1057 1061 4 9.5 105.7 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz add_months: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -add_months wholestage off 1408 1421 19 7.1 140.8 1.0X -add_months wholestage on 1434 1440 7 7.0 143.4 1.0X +add_months wholestage off 1401 1401 1 7.1 140.1 1.0X +add_months wholestage on 1438 1442 4 7.0 143.8 1.0X ================================================================================================ @@ -190,8 +190,8 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz format date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -format date wholestage off 5937 6169 328 1.7 593.7 1.0X -format date wholestage on 5836 5878 74 1.7 583.6 1.0X +format date wholestage off 5482 5803 454 1.8 548.2 1.0X +format date wholestage on 5502 5518 9 1.8 550.2 1.0X ================================================================================================ @@ -202,8 +202,8 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz from_unixtime: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -from_unixtime wholestage off 8904 8914 14 1.1 890.4 1.0X -from_unixtime wholestage on 8918 8936 13 1.1 891.8 1.0X +from_unixtime wholestage off 8538 8553 22 1.2 853.8 1.0X +from_unixtime wholestage on 8545 8552 6 1.2 854.5 1.0X ================================================================================================ @@ -214,15 +214,15 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz from_utc_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -from_utc_timestamp wholestage off 1110 1112 3 9.0 111.0 1.0X -from_utc_timestamp wholestage on 1115 1119 3 9.0 111.5 1.0X +from_utc_timestamp wholestage off 1094 1099 8 9.1 109.4 1.0X +from_utc_timestamp wholestage on 1109 1114 5 9.0 110.9 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_utc_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_utc_timestamp wholestage off 1524 1525 1 6.6 152.4 1.0X -to_utc_timestamp wholestage on 1450 1458 14 6.9 145.0 1.1X +to_utc_timestamp wholestage off 1466 1469 4 6.8 146.6 1.0X +to_utc_timestamp wholestage on 1401 1408 7 7.1 140.1 1.0X ================================================================================================ @@ -233,29 +233,29 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz cast interval: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -cast interval wholestage off 341 342 1 29.3 34.1 1.0X -cast interval wholestage on 285 294 7 35.1 28.5 1.2X +cast interval wholestage off 332 332 0 30.1 33.2 1.0X +cast interval wholestage on 315 324 10 31.7 31.5 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz datediff: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -datediff wholestage off 1874 1881 10 5.3 187.4 1.0X -datediff wholestage on 1785 1791 3 5.6 178.5 1.0X +datediff wholestage off 1796 1802 8 5.6 179.6 1.0X +datediff wholestage on 1758 1764 10 5.7 175.8 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz months_between: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -months_between wholestage off 5038 5042 5 2.0 503.8 1.0X -months_between wholestage on 4979 4987 8 2.0 497.9 1.0X +months_between wholestage off 4833 4836 4 2.1 483.3 1.0X +months_between wholestage on 4777 4780 2 2.1 477.7 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz window: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -window wholestage off 1716 1841 177 0.6 1716.2 1.0X -window wholestage on 46024 46063 27 0.0 46024.1 0.0X +window wholestage off 1812 1908 136 0.6 1811.7 1.0X +window wholestage on 46279 46376 74 0.0 46278.8 0.0X ================================================================================================ @@ -266,134 +266,134 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YEAR: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YEAR wholestage off 2428 2429 2 4.1 242.8 1.0X -date_trunc YEAR wholestage on 2451 2469 12 4.1 245.1 1.0X +date_trunc YEAR wholestage off 2367 2368 1 4.2 236.7 1.0X +date_trunc YEAR wholestage on 2321 2334 22 4.3 232.1 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YYYY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YYYY wholestage off 2423 2426 3 4.1 242.3 1.0X -date_trunc YYYY wholestage on 2454 2462 8 4.1 245.4 1.0X +date_trunc YYYY wholestage off 2330 2334 5 4.3 233.0 1.0X +date_trunc YYYY wholestage on 2326 2332 5 4.3 232.6 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc YY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc YY wholestage off 2421 2441 28 4.1 242.1 1.0X -date_trunc YY wholestage on 2453 2461 9 4.1 245.3 1.0X +date_trunc YY wholestage off 2334 2335 1 4.3 233.4 1.0X +date_trunc YY wholestage on 2315 2324 6 4.3 231.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MON: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MON wholestage off 2425 2427 3 4.1 242.5 1.0X -date_trunc MON wholestage on 2431 2438 9 4.1 243.1 1.0X +date_trunc MON wholestage off 2327 2330 4 4.3 232.7 1.0X +date_trunc MON wholestage on 2279 2289 12 4.4 227.9 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MONTH: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MONTH wholestage off 2427 2433 8 4.1 242.7 1.0X -date_trunc MONTH wholestage on 2429 2435 4 4.1 242.9 1.0X +date_trunc MONTH wholestage off 2330 2332 2 4.3 233.0 1.0X +date_trunc MONTH wholestage on 2277 2284 6 4.4 227.7 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MM: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MM wholestage off 2425 2431 9 4.1 242.5 1.0X -date_trunc MM wholestage on 2430 2435 4 4.1 243.0 1.0X +date_trunc MM wholestage off 2328 2329 2 4.3 232.8 1.0X +date_trunc MM wholestage on 2279 2284 4 4.4 227.9 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc DAY: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc DAY wholestage off 2117 2119 4 4.7 211.7 1.0X -date_trunc DAY wholestage on 2036 2118 174 4.9 203.6 1.0X +date_trunc DAY wholestage off 1974 1984 14 5.1 197.4 1.0X +date_trunc DAY wholestage on 1914 1922 7 5.2 191.4 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc DD: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc DD wholestage off 2116 2119 5 4.7 211.6 1.0X -date_trunc DD wholestage on 2035 2043 10 4.9 203.5 1.0X +date_trunc DD wholestage off 1967 1976 12 5.1 196.7 1.0X +date_trunc DD wholestage on 1913 1917 4 5.2 191.3 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc HOUR: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc HOUR wholestage off 2013 2014 2 5.0 201.3 1.0X -date_trunc HOUR wholestage on 2077 2088 13 4.8 207.7 1.0X +date_trunc HOUR wholestage off 1970 1970 0 5.1 197.0 1.0X +date_trunc HOUR wholestage on 1945 1946 2 5.1 194.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc MINUTE: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc MINUTE wholestage off 363 368 8 27.6 36.3 1.0X -date_trunc MINUTE wholestage on 321 326 7 31.2 32.1 1.1X +date_trunc MINUTE wholestage off 361 361 1 27.7 36.1 1.0X +date_trunc MINUTE wholestage on 331 336 4 30.2 33.1 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc SECOND: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc SECOND wholestage off 365 366 0 27.4 36.5 1.0X -date_trunc SECOND wholestage on 319 332 16 31.4 31.9 1.1X +date_trunc SECOND wholestage off 360 361 1 27.8 36.0 1.0X +date_trunc SECOND wholestage on 335 348 15 29.8 33.5 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc WEEK: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc WEEK wholestage off 2371 2376 7 4.2 237.1 1.0X -date_trunc WEEK wholestage on 2314 2322 8 4.3 231.4 1.0X +date_trunc WEEK wholestage off 2232 2236 6 4.5 223.2 1.0X +date_trunc WEEK wholestage on 2225 2232 6 4.5 222.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz date_trunc QUARTER: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -date_trunc QUARTER wholestage off 3334 3335 1 3.0 333.4 1.0X -date_trunc QUARTER wholestage on 3286 3291 7 3.0 328.6 1.0X +date_trunc QUARTER wholestage off 3083 3086 4 3.2 308.3 1.0X +date_trunc QUARTER wholestage on 3073 3086 16 3.3 307.3 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc year: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc year wholestage off 303 304 2 33.0 30.3 1.0X -trunc year wholestage on 283 291 5 35.3 28.3 1.1X +trunc year wholestage off 321 321 0 31.1 32.1 1.0X +trunc year wholestage on 299 303 5 33.5 29.9 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc yyyy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc yyyy wholestage off 324 330 8 30.9 32.4 1.0X -trunc yyyy wholestage on 283 291 9 35.3 28.3 1.1X +trunc yyyy wholestage off 323 327 5 30.9 32.3 1.0X +trunc yyyy wholestage on 299 302 3 33.4 29.9 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc yy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc yy wholestage off 304 305 3 32.9 30.4 1.0X -trunc yy wholestage on 283 302 28 35.3 28.3 1.1X +trunc yy wholestage off 315 315 1 31.8 31.5 1.0X +trunc yy wholestage on 299 304 4 33.4 29.9 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc mon: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc mon wholestage off 315 319 6 31.7 31.5 1.0X -trunc mon wholestage on 284 287 5 35.3 28.4 1.1X +trunc mon wholestage off 320 321 1 31.2 32.0 1.0X +trunc mon wholestage on 299 307 10 33.4 29.9 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc month: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc month wholestage off 305 314 13 32.8 30.5 1.0X -trunc month wholestage on 283 292 14 35.3 28.3 1.1X +trunc month wholestage off 316 317 1 31.6 31.6 1.0X +trunc month wholestage on 299 302 5 33.5 29.9 1.1X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz trunc mm: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -trunc mm wholestage off 301 301 0 33.2 30.1 1.0X -trunc mm wholestage on 285 290 7 35.1 28.5 1.1X +trunc mm wholestage off 313 313 1 32.0 31.3 1.0X +trunc mm wholestage on 298 302 4 33.5 29.8 1.0X ================================================================================================ @@ -404,36 +404,36 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to timestamp str: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to timestamp str wholestage off 218 220 3 4.6 218.4 1.0X -to timestamp str wholestage on 213 216 6 4.7 212.5 1.0X +to timestamp str wholestage off 217 217 0 4.6 217.3 1.0X +to timestamp str wholestage on 209 212 2 4.8 209.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_timestamp wholestage off 1838 1842 5 0.5 1838.1 1.0X -to_timestamp wholestage on 1952 1971 11 0.5 1952.2 0.9X +to_timestamp wholestage off 1676 1677 2 0.6 1675.6 1.0X +to_timestamp wholestage on 1599 1606 8 0.6 1599.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_unix_timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_unix_timestamp wholestage off 1987 1988 1 0.5 1986.9 1.0X -to_unix_timestamp wholestage on 1944 1948 3 0.5 1944.2 1.0X +to_unix_timestamp wholestage off 1582 1589 9 0.6 1582.1 1.0X +to_unix_timestamp wholestage on 1634 1637 3 0.6 1633.8 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to date str: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to date str wholestage off 263 264 0 3.8 263.5 1.0X -to date str wholestage on 263 265 2 3.8 262.6 1.0X +to date str wholestage off 275 282 9 3.6 275.0 1.0X +to date str wholestage on 264 265 2 3.8 263.5 1.0X OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz to_date: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -to_date wholestage off 3560 3567 11 0.3 3559.7 1.0X -to_date wholestage on 3525 3534 10 0.3 3524.8 1.0X +to_date wholestage off 3170 3188 25 0.3 3170.1 1.0X +to_date wholestage on 3134 3143 10 0.3 3134.3 1.0X ================================================================================================ @@ -444,14 +444,14 @@ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aw Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -From java.sql.Date 405 416 16 12.3 81.0 1.0X -From java.time.LocalDate 344 352 14 14.5 68.8 1.2X -Collect java.sql.Date 1622 2553 1372 3.1 324.4 0.2X -Collect java.time.LocalDate 1464 1482 20 3.4 292.8 0.3X -From java.sql.Timestamp 248 258 15 20.2 49.6 1.6X -From java.time.Instant 237 243 7 21.1 47.4 1.7X -Collect longs 1252 1341 109 4.0 250.5 0.3X -Collect java.sql.Timestamp 1515 1516 2 3.3 302.9 0.3X -Collect java.time.Instant 1379 1490 96 3.6 275.8 0.3X +From java.sql.Date 407 413 7 12.3 81.5 1.0X +From java.time.LocalDate 340 344 5 14.7 68.1 1.2X +Collect java.sql.Date 1700 2658 1422 2.9 340.0 0.2X +Collect java.time.LocalDate 1473 1494 30 3.4 294.6 0.3X +From java.sql.Timestamp 252 266 13 19.8 50.5 1.6X +From java.time.Instant 236 243 7 21.1 47.3 1.7X +Collect longs 1280 1337 79 3.9 256.1 0.3X +Collect java.sql.Timestamp 1485 1501 15 3.4 297.0 0.3X +Collect java.time.Instant 1441 1465 37 3.5 288.1 0.3X diff --git a/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt b/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt index 03bc334471e56..d0cd591da4c94 100644 --- a/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt @@ -3,110 +3,110 @@ Benchmark for performance of JSON parsing ================================================================================================ Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 46010 46118 113 2.2 460.1 1.0X -UTF-8 is set 54407 55427 1718 1.8 544.1 0.8X +No encoding 68879 68993 116 1.5 688.8 1.0X +UTF-8 is set 115270 115602 455 0.9 1152.7 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 26614 28220 1461 3.8 266.1 1.0X -UTF-8 is set 42765 43400 550 2.3 427.6 0.6X +No encoding 47452 47538 113 2.1 474.5 1.0X +UTF-8 is set 77330 77354 30 1.3 773.3 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 35696 35821 113 0.3 3569.6 1.0X -UTF-8 is set 55441 56176 1037 0.2 5544.1 0.6X +No encoding 60470 60900 534 0.2 6047.0 1.0X +UTF-8 is set 104733 104931 189 0.1 10473.3 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 61514 62968 NaN 0.0 123027.2 1.0X -UTF-8 is set 72096 72933 1162 0.0 144192.7 0.9X +No encoding 130302 131072 976 0.0 260604.6 1.0X +UTF-8 is set 150860 151284 377 0.0 301720.1 0.9X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns 9859 9913 79 1.0 985.9 1.0X -Select 1 column 10981 11003 36 0.9 1098.1 0.9X +Select 10 columns 18619 18684 99 0.5 1861.9 1.0X +Select 1 column 24227 24270 38 0.4 2422.7 0.8X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Short column without encoding 3555 3579 27 2.8 355.5 1.0X -Short column with UTF-8 5204 5227 35 1.9 520.4 0.7X -Wide column without encoding 60458 60637 164 0.2 6045.8 0.1X -Wide column with UTF-8 77544 78111 551 0.1 7754.4 0.0X +Short column without encoding 7947 7971 21 1.3 794.7 1.0X +Short column with UTF-8 12700 12753 58 0.8 1270.0 0.6X +Wide column without encoding 92632 92955 463 0.1 9263.2 0.1X +Wide column with UTF-8 147013 147170 188 0.1 14701.3 0.1X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 342 346 3 29.2 34.2 1.0X -from_json 7123 7318 179 1.4 712.3 0.0X -json_tuple 9843 9957 132 1.0 984.3 0.0X -get_json_object 7827 8046 194 1.3 782.7 0.0X +Text read 713 734 19 14.0 71.3 1.0X +from_json 22019 22429 456 0.5 2201.9 0.0X +json_tuple 27987 28047 74 0.4 2798.7 0.0X +get_json_object 21468 21870 350 0.5 2146.8 0.0X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 1856 1884 32 26.9 37.1 1.0X -schema inferring 16734 16900 153 3.0 334.7 0.1X -parsing 14884 15203 470 3.4 297.7 0.1X +Text read 2887 2910 24 17.3 57.7 1.0X +schema inferring 31793 31843 43 1.6 635.9 0.1X +parsing 36791 37104 294 1.4 735.8 0.1X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 5932 6148 228 8.4 118.6 1.0X -Schema inferring 20836 21938 1086 2.4 416.7 0.3X -Parsing without charset 18134 18661 457 2.8 362.7 0.3X -Parsing with UTF-8 27734 28069 378 1.8 554.7 0.2X +Text read 10570 10611 45 4.7 211.4 1.0X +Schema inferring 48729 48763 41 1.0 974.6 0.2X +Parsing without charset 35490 35648 141 1.4 709.8 0.3X +Parsing with UTF-8 63853 63994 163 0.8 1277.1 0.2X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 889 914 28 11.2 88.9 1.0X -to_json(timestamp) 7920 8172 353 1.3 792.0 0.1X -write timestamps to files 6726 6822 129 1.5 672.6 0.1X -Create a dataset of dates 953 963 12 10.5 95.3 0.9X -to_json(date) 5370 5705 320 1.9 537.0 0.2X -write dates to files 4109 4166 52 2.4 410.9 0.2X +Create a dataset of timestamps 2187 2190 5 4.6 218.7 1.0X +to_json(timestamp) 16262 16503 323 0.6 1626.2 0.1X +write timestamps to files 11679 11692 12 0.9 1167.9 0.2X +Create a dataset of dates 2297 2310 12 4.4 229.7 1.0X +to_json(date) 10904 10956 46 0.9 1090.4 0.2X +write dates to files 6610 6645 35 1.5 661.0 0.3X -Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 1614 1675 55 6.2 161.4 1.0X -read timestamps from files 16640 16858 209 0.6 1664.0 0.1X -infer timestamps from files 33239 33388 227 0.3 3323.9 0.0X -read date text from files 1310 1340 44 7.6 131.0 1.2X -read date from files 9470 9513 41 1.1 947.0 0.2X -timestamp strings 1303 1342 47 7.7 130.3 1.2X -parse timestamps from Dataset[String] 17650 18073 380 0.6 1765.0 0.1X -infer timestamps from Dataset[String] 32623 34065 1330 0.3 3262.3 0.0X -date strings 1864 1871 7 5.4 186.4 0.9X -parse dates from Dataset[String] 10914 11316 482 0.9 1091.4 0.1X -from_json(timestamp) 21102 21990 929 0.5 2110.2 0.1X -from_json(date) 15275 15961 598 0.7 1527.5 0.1X +read timestamp text from files 2524 2530 9 4.0 252.4 1.0X +read timestamps from files 41002 41052 59 0.2 4100.2 0.1X +infer timestamps from files 84621 84939 526 0.1 8462.1 0.0X +read date text from files 2292 2302 9 4.4 229.2 1.1X +read date from files 16954 16976 21 0.6 1695.4 0.1X +timestamp strings 3067 3077 13 3.3 306.7 0.8X +parse timestamps from Dataset[String] 48690 48971 243 0.2 4869.0 0.1X +infer timestamps from Dataset[String] 97463 97786 338 0.1 9746.3 0.0X +date strings 3952 3956 3 2.5 395.2 0.6X +parse dates from Dataset[String] 24210 24241 30 0.4 2421.0 0.1X +from_json(timestamp) 71710 72242 629 0.1 7171.0 0.0X +from_json(date) 42465 42481 13 0.2 4246.5 0.1X diff --git a/sql/core/benchmarks/JsonBenchmark-results.txt b/sql/core/benchmarks/JsonBenchmark-results.txt index 0f188c4cdea56..46d2410fb47c3 100644 --- a/sql/core/benchmarks/JsonBenchmark-results.txt +++ b/sql/core/benchmarks/JsonBenchmark-results.txt @@ -3,110 +3,110 @@ Benchmark for performance of JSON parsing ================================================================================================ Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 38998 41002 NaN 2.6 390.0 1.0X -UTF-8 is set 61231 63282 1854 1.6 612.3 0.6X +No encoding 63981 64044 56 1.6 639.8 1.0X +UTF-8 is set 112672 113350 962 0.9 1126.7 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 28272 28338 70 3.5 282.7 1.0X -UTF-8 is set 58681 62243 1517 1.7 586.8 0.5X +No encoding 51256 51449 180 2.0 512.6 1.0X +UTF-8 is set 83694 83859 148 1.2 836.9 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 44026 51829 1329 0.2 4402.6 1.0X -UTF-8 is set 65839 68596 500 0.2 6583.9 0.7X +No encoding 58440 59097 569 0.2 5844.0 1.0X +UTF-8 is set 102746 102883 198 0.1 10274.6 0.6X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -No encoding 72144 74820 NaN 0.0 144287.6 1.0X -UTF-8 is set 69571 77888 NaN 0.0 139142.3 1.0X +No encoding 128982 129304 356 0.0 257965.0 1.0X +UTF-8 is set 147247 147415 231 0.0 294494.1 0.9X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Select 10 columns 9502 9604 106 1.1 950.2 1.0X -Select 1 column 11861 11948 109 0.8 1186.1 0.8X +Select 10 columns 18837 19048 331 0.5 1883.7 1.0X +Select 1 column 24707 24723 14 0.4 2470.7 0.8X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Short column without encoding 3830 3846 15 2.6 383.0 1.0X -Short column with UTF-8 5538 5543 7 1.8 553.8 0.7X -Wide column without encoding 66899 69158 NaN 0.1 6689.9 0.1X -Wide column with UTF-8 90052 93235 NaN 0.1 9005.2 0.0X +Short column without encoding 8218 8234 17 1.2 821.8 1.0X +Short column with UTF-8 12374 12438 107 0.8 1237.4 0.7X +Wide column without encoding 136918 137298 345 0.1 13691.8 0.1X +Wide column with UTF-8 176961 177142 257 0.1 17696.1 0.0X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 659 674 13 15.2 65.9 1.0X -from_json 7676 7943 405 1.3 767.6 0.1X -json_tuple 9881 10172 273 1.0 988.1 0.1X -get_json_object 7949 8055 119 1.3 794.9 0.1X +Text read 1268 1278 12 7.9 126.8 1.0X +from_json 23348 23479 176 0.4 2334.8 0.1X +json_tuple 29606 30221 1024 0.3 2960.6 0.0X +get_json_object 21898 22148 226 0.5 2189.8 0.1X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 3314 3326 17 15.1 66.3 1.0X -schema inferring 16549 17037 484 3.0 331.0 0.2X -parsing 15138 15283 172 3.3 302.8 0.2X +Text read 5887 5944 49 8.5 117.7 1.0X +schema inferring 46696 47054 312 1.1 933.9 0.1X +parsing 32336 32450 129 1.5 646.7 0.2X Preparing data for benchmarking ... -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Text read 5136 5446 268 9.7 102.7 1.0X -Schema inferring 19864 20568 1191 2.5 397.3 0.3X -Parsing without charset 17535 17888 329 2.9 350.7 0.3X -Parsing with UTF-8 25609 25758 218 2.0 512.2 0.2X +Text read 9756 9769 11 5.1 195.1 1.0X +Schema inferring 51318 51433 108 1.0 1026.4 0.2X +Parsing without charset 43609 43743 118 1.1 872.2 0.2X +Parsing with UTF-8 60775 60844 106 0.8 1215.5 0.2X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -Create a dataset of timestamps 784 790 7 12.8 78.4 1.0X -to_json(timestamp) 8005 8055 50 1.2 800.5 0.1X -write timestamps to files 6515 6559 45 1.5 651.5 0.1X -Create a dataset of dates 854 881 24 11.7 85.4 0.9X -to_json(date) 5187 5194 7 1.9 518.7 0.2X -write dates to files 3663 3684 22 2.7 366.3 0.2X +Create a dataset of timestamps 1998 2015 17 5.0 199.8 1.0X +to_json(timestamp) 18156 18317 263 0.6 1815.6 0.1X +write timestamps to files 12912 12917 5 0.8 1291.2 0.2X +Create a dataset of dates 2209 2270 53 4.5 220.9 0.9X +to_json(date) 9433 9489 90 1.1 943.3 0.2X +write dates to files 6915 6923 8 1.4 691.5 0.3X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4 -Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -read timestamp text from files 1297 1316 26 7.7 129.7 1.0X -read timestamps from files 16915 17723 963 0.6 1691.5 0.1X -infer timestamps from files 33967 34304 360 0.3 3396.7 0.0X -read date text from files 1095 1100 7 9.1 109.5 1.2X -read date from files 8376 8513 209 1.2 837.6 0.2X -timestamp strings 1807 1816 8 5.5 180.7 0.7X -parse timestamps from Dataset[String] 18189 18242 74 0.5 1818.9 0.1X -infer timestamps from Dataset[String] 37906 38547 571 0.3 3790.6 0.0X -date strings 2191 2194 4 4.6 219.1 0.6X -parse dates from Dataset[String] 11593 11625 33 0.9 1159.3 0.1X -from_json(timestamp) 22589 22650 101 0.4 2258.9 0.1X -from_json(date) 16479 16619 159 0.6 1647.9 0.1X +read timestamp text from files 2395 2412 17 4.2 239.5 1.0X +read timestamps from files 47269 47334 89 0.2 4726.9 0.1X +infer timestamps from files 91806 91851 67 0.1 9180.6 0.0X +read date text from files 2118 2133 13 4.7 211.8 1.1X +read date from files 17267 17340 115 0.6 1726.7 0.1X +timestamp strings 3906 3935 26 2.6 390.6 0.6X +parse timestamps from Dataset[String] 52244 52534 279 0.2 5224.4 0.0X +infer timestamps from Dataset[String] 100488 100714 198 0.1 10048.8 0.0X +date strings 4572 4584 12 2.2 457.2 0.5X +parse dates from Dataset[String] 26749 26768 17 0.4 2674.9 0.1X +from_json(timestamp) 71414 71867 556 0.1 7141.4 0.0X +from_json(date) 45322 45549 250 0.2 4532.2 0.1X diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala index 1df812d1aa809..89915d254883d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala @@ -22,6 +22,7 @@ import java.util.UUID import org.apache.hadoop.fs.Path +import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql.{AnalysisException, Row, SparkSession} import org.apache.spark.sql.catalyst.{InternalRow, QueryPlanningTracker} @@ -50,7 +51,7 @@ import org.apache.spark.util.Utils class QueryExecution( val sparkSession: SparkSession, val logical: LogicalPlan, - val tracker: QueryPlanningTracker = new QueryPlanningTracker) { + val tracker: QueryPlanningTracker = new QueryPlanningTracker) extends Logging { // TODO: Move the planner an optimizer into here from SessionState. protected def planner = sparkSession.sessionState.planner @@ -133,26 +134,42 @@ class QueryExecution( tracker.measurePhase(phase)(block) } - def simpleString: String = simpleString(false) - - def simpleString(formatted: Boolean): String = withRedaction { + def simpleString: String = { val concat = new PlanStringConcat() - concat.append("== Physical Plan ==\n") + simpleString(false, SQLConf.get.maxToStringFields, concat.append) + withRedaction { + concat.toString + } + } + + private def simpleString( + formatted: Boolean, + maxFields: Int, + append: String => Unit): Unit = { + append("== Physical Plan ==\n") if (formatted) { try { - ExplainUtils.processPlan(executedPlan, concat.append) + ExplainUtils.processPlan(executedPlan, append) } catch { - case e: AnalysisException => concat.append(e.toString) - case e: IllegalArgumentException => concat.append(e.toString) + case e: AnalysisException => append(e.toString) + case e: IllegalArgumentException => append(e.toString) } } else { - QueryPlan.append(executedPlan, concat.append, verbose = false, addSuffix = false) + QueryPlan.append(executedPlan, + append, verbose = false, addSuffix = false, maxFields = maxFields) } - concat.append("\n") - concat.toString + append("\n") } def explainString(mode: ExplainMode): String = { + val concat = new PlanStringConcat() + explainString(mode, SQLConf.get.maxToStringFields, concat.append) + withRedaction { + concat.toString + } + } + + private def explainString(mode: ExplainMode, maxFields: Int, append: String => Unit): Unit = { val queryExecution = if (logical.isStreaming) { // This is used only by explaining `Dataset/DataFrame` created by `spark.readStream`, so the // output mode does not matter since there is no `Sink`. @@ -165,19 +182,19 @@ class QueryExecution( mode match { case SimpleMode => - queryExecution.simpleString + queryExecution.simpleString(false, maxFields, append) case ExtendedMode => - queryExecution.toString + queryExecution.toString(maxFields, append) case CodegenMode => try { - org.apache.spark.sql.execution.debug.codegenString(queryExecution.executedPlan) + org.apache.spark.sql.execution.debug.writeCodegen(append, queryExecution.executedPlan) } catch { - case e: AnalysisException => e.toString + case e: AnalysisException => append(e.toString) } case CostMode => - queryExecution.stringWithStats + queryExecution.stringWithStats(maxFields, append) case FormattedMode => - queryExecution.simpleString(formatted = true) + queryExecution.simpleString(formatted = true, maxFields = maxFields, append) } } @@ -204,27 +221,39 @@ class QueryExecution( override def toString: String = withRedaction { val concat = new PlanStringConcat() - writePlans(concat.append, SQLConf.get.maxToStringFields) - concat.toString + toString(SQLConf.get.maxToStringFields, concat.append) + withRedaction { + concat.toString + } + } + + private def toString(maxFields: Int, append: String => Unit): Unit = { + writePlans(append, maxFields) } - def stringWithStats: String = withRedaction { + def stringWithStats: String = { val concat = new PlanStringConcat() + stringWithStats(SQLConf.get.maxToStringFields, concat.append) + withRedaction { + concat.toString + } + } + + private def stringWithStats(maxFields: Int, append: String => Unit): Unit = { val maxFields = SQLConf.get.maxToStringFields // trigger to compute stats for logical plans try { optimizedPlan.stats } catch { - case e: AnalysisException => concat.append(e.toString + "\n") + case e: AnalysisException => append(e.toString + "\n") } // only show optimized logical plan and physical plan - concat.append("== Optimized Logical Plan ==\n") - QueryPlan.append(optimizedPlan, concat.append, verbose = true, addSuffix = true, maxFields) - concat.append("\n== Physical Plan ==\n") - QueryPlan.append(executedPlan, concat.append, verbose = true, addSuffix = false, maxFields) - concat.append("\n") - concat.toString + append("== Optimized Logical Plan ==\n") + QueryPlan.append(optimizedPlan, append, verbose = true, addSuffix = true, maxFields) + append("\n== Physical Plan ==\n") + QueryPlan.append(executedPlan, append, verbose = true, addSuffix = false, maxFields) + append("\n") } /** @@ -261,19 +290,26 @@ class QueryExecution( /** * Dumps debug information about query execution into the specified file. * + * @param path path of the file the debug info is written to. * @param maxFields maximum number of fields converted to string representation. + * @param explainMode the explain mode to be used to generate the string + * representation of the plan. */ - def toFile(path: String, maxFields: Int = Int.MaxValue): Unit = { + def toFile( + path: String, + maxFields: Int = Int.MaxValue, + explainMode: Option[String] = None): Unit = { val filePath = new Path(path) val fs = filePath.getFileSystem(sparkSession.sessionState.newHadoopConf()) val writer = new BufferedWriter(new OutputStreamWriter(fs.create(filePath))) - val append = (s: String) => { - writer.write(s) - } try { - writePlans(append, maxFields) - writer.write("\n== Whole Stage Codegen ==\n") - org.apache.spark.sql.execution.debug.writeCodegen(writer.write, executedPlan) + val mode = explainMode.map(ExplainMode.fromString(_)).getOrElse(ExtendedMode) + explainString(mode, maxFields, writer.write) + if (mode != CodegenMode) { + writer.write("\n== Whole Stage Codegen ==\n") + org.apache.spark.sql.execution.debug.writeCodegen(writer.write, executedPlan) + } + log.info(s"Debug information was written at: $filePath") } finally { writer.close() } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 5481337bf6cee..0cca3e7b47c56 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1306,7 +1306,7 @@ object functions { * @since 1.4.0 */ @scala.annotation.varargs - def struct(cols: Column*): Column = withExpr { CreateStruct(cols.map(_.expr)) } + def struct(cols: Column*): Column = withExpr { CreateStruct.create(cols.map(_.expr)) } /** * Creates a new struct column that composes multiple input columns. diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java index 5603cb988b8e7..af0a22b036030 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java @@ -18,6 +18,8 @@ package test.org.apache.spark.sql; import java.io.Serializable; +import java.sql.Timestamp; +import java.text.SimpleDateFormat; import java.time.Instant; import java.time.LocalDate; import java.util.*; @@ -210,6 +212,17 @@ private static Row createRecordSpark22000Row(Long index) { return new GenericRow(values); } + private static String timestampToString(Timestamp ts) { + String timestampString = String.valueOf(ts); + String formatted = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(ts); + + if (timestampString.length() > 19 && !timestampString.substring(19).equals(".0")) { + return formatted + timestampString.substring(19); + } else { + return formatted; + } + } + private static RecordSpark22000 createRecordSpark22000(Row recordRow) { RecordSpark22000 record = new RecordSpark22000(); record.setShortField(String.valueOf(recordRow.getShort(0))); @@ -219,7 +232,7 @@ private static RecordSpark22000 createRecordSpark22000(Row recordRow) { record.setDoubleField(String.valueOf(recordRow.getDouble(4))); record.setStringField(recordRow.getString(5)); record.setBooleanField(String.valueOf(recordRow.getBoolean(6))); - record.setTimestampField(String.valueOf(recordRow.getTimestamp(7))); + record.setTimestampField(timestampToString(recordRow.getTimestamp(7))); // This would figure out that null value will not become "null". record.setNullIntField(null); return record; diff --git a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md index a4f076396c517..d245aa5a17345 100644 --- a/sql/core/src/test/resources/sql-functions/sql-expression-schema.md +++ b/sql/core/src/test/resources/sql-functions/sql-expression-schema.md @@ -1,8 +1,8 @@ ## Summary - - Number of queries: 336 + - Number of queries: 337 - Number of expressions that missing example: 34 - - Expressions missing examples: and,string,tinyint,double,smallint,date,decimal,boolean,float,binary,bigint,int,timestamp,cume_dist,dense_rank,input_file_block_length,input_file_block_start,input_file_name,lag,lead,monotonically_increasing_id,ntile,struct,!,not,or,percent_rank,rank,row_number,spark_partition_id,version,window,positive,count_min_sketch + - Expressions missing examples: and,string,tinyint,double,smallint,date,decimal,boolean,float,binary,bigint,int,timestamp,struct,cume_dist,dense_rank,input_file_block_length,input_file_block_start,input_file_name,lag,lead,monotonically_increasing_id,ntile,!,not,or,percent_rank,rank,row_number,spark_partition_id,version,window,positive,count_min_sketch ## Schema of Built-in Functions | Class name | Function name or alias | Query example | Output schema | | ---------- | ---------------------- | ------------- | ------------- | @@ -79,9 +79,11 @@ | org.apache.spark.sql.catalyst.expressions.CreateArray | array | SELECT array(1, 2, 3) | struct> | | org.apache.spark.sql.catalyst.expressions.CreateMap | map | SELECT map(1.0, '2', 3.0, '4') | struct> | | org.apache.spark.sql.catalyst.expressions.CreateNamedStruct | named_struct | SELECT named_struct("a", 1, "b", 2, "c", 3) | struct> | +| org.apache.spark.sql.catalyst.expressions.CreateNamedStruct | struct | N/A | N/A | | org.apache.spark.sql.catalyst.expressions.CsvToStructs | from_csv | SELECT from_csv('1, 0.8', 'a INT, b DOUBLE') | struct> | | org.apache.spark.sql.catalyst.expressions.Cube | cube | SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY cube(name, age) | struct | | org.apache.spark.sql.catalyst.expressions.CumeDist | cume_dist | N/A | N/A | +| org.apache.spark.sql.catalyst.expressions.CurrentCatalog | current_catalog | SELECT current_catalog() | struct | | org.apache.spark.sql.catalyst.expressions.CurrentDatabase | current_database | SELECT current_database() | struct | | org.apache.spark.sql.catalyst.expressions.CurrentDate | current_date | SELECT current_date() | struct | | org.apache.spark.sql.catalyst.expressions.CurrentTimestamp | current_timestamp | SELECT current_timestamp() | struct | @@ -156,12 +158,12 @@ | org.apache.spark.sql.catalyst.expressions.LessThanOrEqual | <= | SELECT 2 <= 2 | struct<(2 <= 2):boolean> | | org.apache.spark.sql.catalyst.expressions.Levenshtein | levenshtein | SELECT levenshtein('kitten', 'sitting') | struct | | org.apache.spark.sql.catalyst.expressions.Like | like | SELECT like('Spark', '_park') | struct | -| org.apache.spark.sql.catalyst.expressions.Log | ln | SELECT ln(1) | struct | +| org.apache.spark.sql.catalyst.expressions.Log | ln | SELECT ln(1) | struct | | org.apache.spark.sql.catalyst.expressions.Log10 | log10 | SELECT log10(10) | struct | | org.apache.spark.sql.catalyst.expressions.Log1p | log1p | SELECT log1p(0) | struct | | org.apache.spark.sql.catalyst.expressions.Log2 | log2 | SELECT log2(2) | struct | | org.apache.spark.sql.catalyst.expressions.Logarithm | log | SELECT log(10, 100) | struct | -| org.apache.spark.sql.catalyst.expressions.Lower | lcase | SELECT lcase('SparkSql') | struct | +| org.apache.spark.sql.catalyst.expressions.Lower | lcase | SELECT lcase('SparkSql') | struct | | org.apache.spark.sql.catalyst.expressions.Lower | lower | SELECT lower('SparkSql') | struct | | org.apache.spark.sql.catalyst.expressions.MakeDate | make_date | SELECT make_date(2013, 7, 15) | struct | | org.apache.spark.sql.catalyst.expressions.MakeInterval | make_interval | SELECT make_interval(100, 11, 1, 1, 12, 30, 01.001001) | struct | @@ -170,7 +172,7 @@ | org.apache.spark.sql.catalyst.expressions.MapEntries | map_entries | SELECT map_entries(map(1, 'a', 2, 'b')) | struct>> | | org.apache.spark.sql.catalyst.expressions.MapFilter | map_filter | SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v) | struct namedlambdavariable()), namedlambdavariable(), namedlambdavariable())):map> | | org.apache.spark.sql.catalyst.expressions.MapFromArrays | map_from_arrays | SELECT map_from_arrays(array(1.0, 3.0), array('2', '4')) | struct> | -| org.apache.spark.sql.catalyst.expressions.MapFromEntries | map_from_entries | SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b'))) | struct> | +| org.apache.spark.sql.catalyst.expressions.MapFromEntries | map_from_entries | SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b'))) | struct> | | org.apache.spark.sql.catalyst.expressions.MapKeys | map_keys | SELECT map_keys(map(1, 'a', 2, 'b')) | struct> | | org.apache.spark.sql.catalyst.expressions.MapValues | map_values | SELECT map_values(map(1, 'a', 2, 'b')) | struct> | | org.apache.spark.sql.catalyst.expressions.MapZipWith | map_zip_with | SELECT map_zip_with(map(1, 'a', 2, 'b'), map(1, 'x', 2, 'y'), (k, v1, v2) -> concat(v1, v2)) | struct> | @@ -185,7 +187,6 @@ | org.apache.spark.sql.catalyst.expressions.Murmur3Hash | hash | SELECT hash('Spark', array(123), 2) | struct | | org.apache.spark.sql.catalyst.expressions.NTile | ntile | N/A | N/A | | org.apache.spark.sql.catalyst.expressions.NaNvl | nanvl | SELECT nanvl(cast('NaN' as double), 123) | struct | -| org.apache.spark.sql.catalyst.expressions.NamedStruct | struct | N/A | N/A | | org.apache.spark.sql.catalyst.expressions.NextDay | next_day | SELECT next_day('2015-01-14', 'TU') | struct | | org.apache.spark.sql.catalyst.expressions.Not | ! | N/A | N/A | | org.apache.spark.sql.catalyst.expressions.Not | not | N/A | N/A | @@ -218,7 +219,7 @@ | org.apache.spark.sql.catalyst.expressions.Remainder | mod | SELECT 2 % 1.8 | struct<(CAST(CAST(2 AS DECIMAL(1,0)) AS DECIMAL(2,1)) % CAST(1.8 AS DECIMAL(2,1))):decimal(2,1)> | | org.apache.spark.sql.catalyst.expressions.Reverse | reverse | SELECT reverse('Spark SQL') | struct | | org.apache.spark.sql.catalyst.expressions.Right | right | SELECT right('Spark SQL', 3) | struct | -| org.apache.spark.sql.catalyst.expressions.Rint | rint | SELECT rint(12.3456) | struct | +| org.apache.spark.sql.catalyst.expressions.Rint | rint | SELECT rint(12.3456) | struct | | org.apache.spark.sql.catalyst.expressions.Rollup | rollup | SELECT name, age, count(*) FROM VALUES (2, 'Alice'), (5, 'Bob') people(age, name) GROUP BY rollup(name, age) | struct | | org.apache.spark.sql.catalyst.expressions.Round | round | SELECT round(2.5, 0) | struct | | org.apache.spark.sql.catalyst.expressions.RowNumber | row_number | N/A | N/A | @@ -250,7 +251,7 @@ | org.apache.spark.sql.catalyst.expressions.Stack | stack | SELECT stack(2, 1, 2, 3) | struct | | org.apache.spark.sql.catalyst.expressions.StringInstr | instr | SELECT instr('SparkSQL', 'SQL') | struct | | org.apache.spark.sql.catalyst.expressions.StringLPad | lpad | SELECT lpad('hi', 5, '??') | struct | -| org.apache.spark.sql.catalyst.expressions.StringLocate | position | SELECT position('bar', 'foobarbar') | struct | +| org.apache.spark.sql.catalyst.expressions.StringLocate | position | SELECT position('bar', 'foobarbar') | struct | | org.apache.spark.sql.catalyst.expressions.StringLocate | locate | SELECT locate('bar', 'foobarbar') | struct | | org.apache.spark.sql.catalyst.expressions.StringRPad | rpad | SELECT rpad('hi', 5, '??') | struct | | org.apache.spark.sql.catalyst.expressions.StringRepeat | repeat | SELECT repeat('123', 2) | struct | diff --git a/sql/core/src/test/resources/sql-tests/inputs/current_database_catalog.sql b/sql/core/src/test/resources/sql-tests/inputs/current_database_catalog.sql new file mode 100644 index 0000000000000..4406f1bc2e6e3 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/current_database_catalog.sql @@ -0,0 +1,2 @@ +-- get current_datebase and current_catalog +select current_database(), current_catalog(); diff --git a/sql/core/src/test/resources/sql-tests/inputs/datetime-legacy.sql b/sql/core/src/test/resources/sql-tests/inputs/datetime-legacy.sql new file mode 100644 index 0000000000000..daec2b40a620b --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/inputs/datetime-legacy.sql @@ -0,0 +1,2 @@ +--SET spark.sql.legacy.timeParserPolicy=LEGACY +--IMPORT datetime.sql diff --git a/sql/core/src/test/resources/sql-tests/inputs/datetime.sql b/sql/core/src/test/resources/sql-tests/inputs/datetime.sql index 0fb373f419e7e..663c62f1a6f66 100644 --- a/sql/core/src/test/resources/sql-tests/inputs/datetime.sql +++ b/sql/core/src/test/resources/sql-tests/inputs/datetime.sql @@ -140,3 +140,23 @@ select to_date("16", "dd"); select to_date("02-29", "MM-dd"); select to_timestamp("2019 40", "yyyy mm"); select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss"); + +-- Unsupported narrow text style +select date_format(date '2020-05-23', 'GGGGG'); +select date_format(date '2020-05-23', 'MMMMM'); +select date_format(date '2020-05-23', 'LLLLL'); +select date_format(timestamp '2020-05-23', 'EEEEE'); +select date_format(timestamp '2020-05-23', 'uuuuu'); +select date_format('2020-05-23', 'QQQQQ'); +select date_format('2020-05-23', 'qqqqq'); +select to_timestamp('2019-10-06 A', 'yyyy-MM-dd GGGGG'); +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE'); +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE'); +select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE'); +select from_unixtime(12345, 'MMMMM'); +select from_unixtime(54321, 'QQQQQ'); +select from_unixtime(23456, 'aaaaa'); +select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')); +select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')); +select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')); +select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')); diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out index 2e61cb8cb8c3f..5857a0ac90c70 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out @@ -1,5 +1,5 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 92 +-- Number of queries: 116 -- !query @@ -838,3 +838,164 @@ select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss") struct -- !query output 2019-01-01 10:10:10 + + +-- !query +select date_format(date '2020-05-23', 'GGGGG') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(date '2020-05-23', 'MMMMM') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMMMM' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(date '2020-05-23', 'LLLLL') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'LLLLL' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(timestamp '2020-05-23', 'EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(timestamp '2020-05-23', 'uuuuu') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'uuuuu' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format('2020-05-23', 'QQQQQ') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Too many pattern letters: Q + + +-- !query +select date_format('2020-05-23', 'qqqqq') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Too many pattern letters: q + + +-- !query +select to_timestamp('2019-10-06 A', 'yyyy-MM-dd GGGGG') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_unixtime(12345, 'MMMMM') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMMMM' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_unixtime(54321, 'QQQQQ') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select from_unixtime(23456, 'aaaaa') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'aaaaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html diff --git a/sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out index b507713a73d1f..d5c0acb40bb1e 100644 --- a/sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out @@ -55,7 +55,7 @@ struct -- !query select position('bar' in 'foobarbar'), position(null, 'foobarbar'), position('aaads', null) -- !query schema -struct +struct -- !query output 4 NULL NULL diff --git a/sql/core/src/test/resources/sql-tests/results/current_database_catalog.sql.out b/sql/core/src/test/resources/sql-tests/results/current_database_catalog.sql.out new file mode 100644 index 0000000000000..b714463a0aa0c --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/results/current_database_catalog.sql.out @@ -0,0 +1,10 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 1 + + +-- !query +select current_database(), current_catalog() +-- !query schema +struct +-- !query output +default spark_catalog diff --git a/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out b/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out new file mode 100644 index 0000000000000..8a726efafad89 --- /dev/null +++ b/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out @@ -0,0 +1,958 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 116 + + +-- !query +select TIMESTAMP_SECONDS(1230219000),TIMESTAMP_SECONDS(-1230219000),TIMESTAMP_SECONDS(null) +-- !query schema +struct +-- !query output +2008-12-25 07:30:00 1931-01-07 00:30:00 NULL + + +-- !query +select TIMESTAMP_MILLIS(1230219000123),TIMESTAMP_MILLIS(-1230219000123),TIMESTAMP_MILLIS(null) +-- !query schema +struct +-- !query output +2008-12-25 07:30:00.123 1931-01-07 00:29:59.877 NULL + + +-- !query +select TIMESTAMP_MICROS(1230219000123123),TIMESTAMP_MICROS(-1230219000123123),TIMESTAMP_MICROS(null) +-- !query schema +struct +-- !query output +2008-12-25 07:30:00.123123 1931-01-07 00:29:59.876877 NULL + + +-- !query +select TIMESTAMP_SECONDS(1230219000123123) +-- !query schema +struct<> +-- !query output +java.lang.ArithmeticException +long overflow + + +-- !query +select TIMESTAMP_SECONDS(-1230219000123123) +-- !query schema +struct<> +-- !query output +java.lang.ArithmeticException +long overflow + + +-- !query +select TIMESTAMP_MILLIS(92233720368547758) +-- !query schema +struct<> +-- !query output +java.lang.ArithmeticException +long overflow + + +-- !query +select TIMESTAMP_MILLIS(-92233720368547758) +-- !query schema +struct<> +-- !query output +java.lang.ArithmeticException +long overflow + + +-- !query +select current_date = current_date(), current_timestamp = current_timestamp() +-- !query schema +struct<(current_date() = current_date()):boolean,(current_timestamp() = current_timestamp()):boolean> +-- !query output +true true + + +-- !query +select to_date(null), to_date('2016-12-31'), to_date('2016-12-31', 'yyyy-MM-dd') +-- !query schema +struct +-- !query output +NULL 2016-12-31 2016-12-31 + + +-- !query +select to_timestamp(null), to_timestamp('2016-12-31 00:12:00'), to_timestamp('2016-12-31', 'yyyy-MM-dd') +-- !query schema +struct +-- !query output +NULL 2016-12-31 00:12:00 2016-12-31 00:00:00 + + +-- !query +select dayofweek('2007-02-03'), dayofweek('2009-07-30'), dayofweek('2017-05-27'), dayofweek(null), dayofweek('1582-10-15 13:10:15') +-- !query schema +struct +-- !query output +7 5 7 NULL 6 + + +-- !query +create temporary view ttf1 as select * from values + (1, 2), + (2, 3) + as ttf1(current_date, current_timestamp) +-- !query schema +struct<> +-- !query output + + + +-- !query +select current_date, current_timestamp from ttf1 +-- !query schema +struct +-- !query output +1 2 +2 3 + + +-- !query +create temporary view ttf2 as select * from values + (1, 2), + (2, 3) + as ttf2(a, b) +-- !query schema +struct<> +-- !query output + + + +-- !query +select current_date = current_date(), current_timestamp = current_timestamp(), a, b from ttf2 +-- !query schema +struct<(current_date() = current_date()):boolean,(current_timestamp() = current_timestamp()):boolean,a:int,b:int> +-- !query output +true true 1 2 +true true 2 3 + + +-- !query +select a, b from ttf2 order by a, current_date +-- !query schema +struct +-- !query output +1 2 +2 3 + + +-- !query +select weekday('2007-02-03'), weekday('2009-07-30'), weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15') +-- !query schema +struct +-- !query output +5 3 5 NULL 4 + + +-- !query +select year('1500-01-01'), month('1500-01-01'), dayOfYear('1500-01-01') +-- !query schema +struct +-- !query output +1500 1 1 + + +-- !query +select date '2019-01-01\t' +-- !query schema +struct +-- !query output +2019-01-01 + + +-- !query +select timestamp '2019-01-01\t' +-- !query schema +struct +-- !query output +2019-01-01 00:00:00 + + +-- !query +select timestamp'2011-11-11 11:11:11' + interval '2' day +-- !query schema +struct +-- !query output +2011-11-13 11:11:11 + + +-- !query +select timestamp'2011-11-11 11:11:11' - interval '2' day +-- !query schema +struct +-- !query output +2011-11-09 11:11:11 + + +-- !query +select date'2011-11-11 11:11:11' + interval '2' second +-- !query schema +struct +-- !query output +2011-11-11 + + +-- !query +select date'2011-11-11 11:11:11' - interval '2' second +-- !query schema +struct +-- !query output +2011-11-10 + + +-- !query +select '2011-11-11' - interval '2' day +-- !query schema +struct +-- !query output +2011-11-09 00:00:00 + + +-- !query +select '2011-11-11 11:11:11' - interval '2' second +-- !query schema +struct +-- !query output +2011-11-11 11:11:09 + + +-- !query +select '1' - interval '2' second +-- !query schema +struct +-- !query output +NULL + + +-- !query +select 1 - interval '2' second +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve '1 + (- INTERVAL '2 seconds')' due to data type mismatch: argument 1 requires timestamp type, however, '1' is of int type.; line 1 pos 7 + + +-- !query +select date'2020-01-01' - timestamp'2019-10-06 10:11:12.345678' +-- !query schema +struct +-- !query output +2078 hours 48 minutes 47.654322 seconds + + +-- !query +select timestamp'2019-10-06 10:11:12.345678' - date'2020-01-01' +-- !query schema +struct +-- !query output +-2078 hours -48 minutes -47.654322 seconds + + +-- !query +select timestamp'2019-10-06 10:11:12.345678' - null +-- !query schema +struct +-- !query output +NULL + + +-- !query +select null - timestamp'2019-10-06 10:11:12.345678' +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date_add('2011-11-11', 1Y) +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_add('2011-11-11', 1S) +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_add('2011-11-11', 1) +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_add('2011-11-11', 1L) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(CAST('2011-11-11' AS DATE), 1L)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, '1L' is of bigint type.; line 1 pos 7 + + +-- !query +select date_add('2011-11-11', 1.0) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(CAST('2011-11-11' AS DATE), 1.0BD)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, '1.0BD' is of decimal(2,1) type.; line 1 pos 7 + + +-- !query +select date_add('2011-11-11', 1E1) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(CAST('2011-11-11' AS DATE), 10.0D)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, '10.0D' is of double type.; line 1 pos 7 + + +-- !query +select date_add('2011-11-11', '1') +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_add('2011-11-11', '1.2') +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +The second argument of 'date_add' function needs to be an integer.; + + +-- !query +select date_add(date'2011-11-11', 1) +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_add(timestamp'2011-11-11', 1) +-- !query schema +struct +-- !query output +2011-11-12 + + +-- !query +select date_sub(date'2011-11-11', 1) +-- !query schema +struct +-- !query output +2011-11-10 + + +-- !query +select date_sub(date'2011-11-11', '1') +-- !query schema +struct +-- !query output +2011-11-10 + + +-- !query +select date_sub(date'2011-11-11', '1.2') +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +The second argument of 'date_sub' function needs to be an integer.; + + +-- !query +select date_sub(timestamp'2011-11-11', 1) +-- !query schema +struct +-- !query output +2011-11-10 + + +-- !query +select date_sub(null, 1) +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date_sub(date'2011-11-11', null) +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date'2011-11-11' + 1E1 +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(DATE '2011-11-11', 10.0D)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, '10.0D' is of double type.; line 1 pos 7 + + +-- !query +select date'2011-11-11' + '1' +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(DATE '2011-11-11', CAST('1' AS DOUBLE))' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, 'CAST('1' AS DOUBLE)' is of double type.; line 1 pos 7 + + +-- !query +select null + date '2001-09-28' +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date '2001-09-28' + 7Y +-- !query schema +struct +-- !query output +2001-10-05 + + +-- !query +select 7S + date '2001-09-28' +-- !query schema +struct +-- !query output +2001-10-05 + + +-- !query +select date '2001-10-01' - 7 +-- !query schema +struct +-- !query output +2001-09-24 + + +-- !query +select date '2001-10-01' - '7' +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_sub(DATE '2001-10-01', CAST('7' AS DOUBLE))' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, 'CAST('7' AS DOUBLE)' is of double type.; line 1 pos 7 + + +-- !query +select date '2001-09-28' + null +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date '2001-09-28' - null +-- !query schema +struct +-- !query output +NULL + + +-- !query +create temp view v as select '1' str +-- !query schema +struct<> +-- !query output + + + +-- !query +select date_add('2011-11-11', str) from v +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_add(CAST('2011-11-11' AS DATE), v.`str`)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, 'v.`str`' is of string type.; line 1 pos 7 + + +-- !query +select date_sub('2011-11-11', str) from v +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +cannot resolve 'date_sub(CAST('2011-11-11' AS DATE), v.`str`)' due to data type mismatch: argument 2 requires (int or smallint or tinyint) type, however, 'v.`str`' is of string type.; line 1 pos 7 + + +-- !query +select null - date '2019-10-06' +-- !query schema +struct +-- !query output +NULL + + +-- !query +select date '2001-10-01' - date '2001-09-28' +-- !query schema +struct +-- !query output +3 days + + +-- !query +select to_timestamp('2019-10-06 10:11:12.', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.0', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.1', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.12', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.123UTC', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.1234', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.12345CST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.123456PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('123456 2019-10-06 10:11:12.123456PST', 'SSSSSS yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('223456 2019-10-06 10:11:12.123456PST', 'SSSSSS yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.1234', 'yyyy-MM-dd HH:mm:ss.[SSSSSS]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.123', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11:12.12', 'yyyy-MM-dd HH:mm[:ss.SSSSSS]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('2019-10-06 10:11', 'yyyy-MM-dd HH:mm[:ss.SSSSSS]') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("12.12342019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyyyy-MM-dd'S'HH:mm") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("12.1232019-10-06S10:11", "ss.SSSSyy-MM-dd'S'HH:mm") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("12.1234019-10-06S10:11", "ss.SSSSy-MM-dd'S'HH:mm") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'") +-- !query schema +struct +-- !query output +2019-10-06 00:00:00 + + +-- !query +select to_timestamp("S2019-10-06", "'S'yyyy-MM-dd") +-- !query schema +struct +-- !query output +2019-10-06 00:00:00 + + +-- !query +select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Illegal pattern character 'e' + + +-- !query +select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uucc') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Illegal pattern character 'c' + + +-- !query +select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu') +-- !query schema +struct +-- !query output +2019-10-06 0007 + + +-- !query +select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS") +-- !query schema +struct +-- !query output +2019-10-06 10:11:12.012 + + +-- !query +select to_timestamp("2019-10-06T10:11:12'", "yyyy-MM-dd'T'HH:mm:ss''") +-- !query schema +struct +-- !query output +2019-10-06 10:11:12 + + +-- !query +select to_timestamp("'2019-10-06T10:11:12", "''yyyy-MM-dd'T'HH:mm:ss") +-- !query schema +struct +-- !query output +2019-10-06 10:11:12 + + +-- !query +select to_timestamp("P2019-10-06T10:11:12", "'P'yyyy-MM-dd'T'HH:mm:ss") +-- !query schema +struct +-- !query output +2019-10-06 10:11:12 + + +-- !query +select to_timestamp("16", "dd") +-- !query schema +struct +-- !query output +1970-01-16 00:00:00 + + +-- !query +select to_timestamp("02-29", "MM-dd") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_date("16", "dd") +-- !query schema +struct +-- !query output +1970-01-16 + + +-- !query +select to_date("02-29", "MM-dd") +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp("2019 40", "yyyy mm") +-- !query schema +struct +-- !query output +2019-01-01 00:40:00 + + +-- !query +select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss") +-- !query schema +struct +-- !query output +2019-01-01 10:10:10 + + +-- !query +select date_format(date '2020-05-23', 'GGGGG') +-- !query schema +struct +-- !query output +AD + + +-- !query +select date_format(date '2020-05-23', 'MMMMM') +-- !query schema +struct +-- !query output +May + + +-- !query +select date_format(date '2020-05-23', 'LLLLL') +-- !query schema +struct +-- !query output +May + + +-- !query +select date_format(timestamp '2020-05-23', 'EEEEE') +-- !query schema +struct +-- !query output +Saturday + + +-- !query +select date_format(timestamp '2020-05-23', 'uuuuu') +-- !query schema +struct +-- !query output +00006 + + +-- !query +select date_format('2020-05-23', 'QQQQQ') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Illegal pattern character 'Q' + + +-- !query +select date_format('2020-05-23', 'qqqqq') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Illegal pattern character 'q' + + +-- !query +select to_timestamp('2019-10-06 A', 'yyyy-MM-dd GGGGG') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE') +-- !query schema +struct +-- !query output +2020-05-22 00:00:00 + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct +-- !query output +2020-05-22 00:00:00 + + +-- !query +select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct +-- !query output +1590130800 + + +-- !query +select from_unixtime(12345, 'MMMMM') +-- !query schema +struct +-- !query output +December + + +-- !query +select from_unixtime(54321, 'QQQQQ') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select from_unixtime(23456, 'aaaaa') +-- !query schema +struct +-- !query output +PM + + +-- !query +select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct> +-- !query output +{"time":2015-10-26 00:00:00} + + +-- !query +select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct> +-- !query output +{"date":2015-10-26} + + +-- !query +select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct> +-- !query output +{"time":2015-10-26 00:00:00} + + +-- !query +select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct> +-- !query output +{"date":2015-10-26} diff --git a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out index 4b879fcfbfc5b..7cacaec42c813 100755 --- a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out @@ -1,5 +1,5 @@ -- Automatically generated by SQLQueryTestSuite --- Number of queries: 91 +-- Number of queries: 116 -- !query @@ -810,3 +810,164 @@ select to_timestamp("2019 10:10:10", "yyyy hh:mm:ss") struct -- !query output 2019-01-01 10:10:10 + + +-- !query +select date_format(date '2020-05-23', 'GGGGG') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(date '2020-05-23', 'MMMMM') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMMMM' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(date '2020-05-23', 'LLLLL') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'LLLLL' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(timestamp '2020-05-23', 'EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format(timestamp '2020-05-23', 'uuuuu') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'uuuuu' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select date_format('2020-05-23', 'QQQQQ') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Too many pattern letters: Q + + +-- !query +select date_format('2020-05-23', 'qqqqq') +-- !query schema +struct<> +-- !query output +java.lang.IllegalArgumentException +Too many pattern letters: q + + +-- !query +select to_timestamp('2019-10-06 A', 'yyyy-MM-dd GGGGG') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select to_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select unix_timestamp('22 05 2020 Friday', 'dd MM yyyy EEEEE') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd MM yyyy EEEEE' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_unixtime(12345, 'MMMMM') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMMMM' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_unixtime(54321, 'QQQQQ') +-- !query schema +struct +-- !query output +NULL + + +-- !query +select from_unixtime(23456, 'aaaaa') +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'aaaaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + + +-- !query +select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy')) +-- !query schema +struct<> +-- !query output +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html diff --git a/sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out b/sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out index 3fcd132701a3f..d41d25280146b 100644 --- a/sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/group-by-filter.sql.out @@ -272,7 +272,7 @@ struct= 0)):bigint> -- !query SELECT 'foo', MAX(STRUCT(a)) FILTER (WHERE b >= 1) FROM testData WHERE a = 0 GROUP BY 1 -- !query schema -struct= 1)):struct> +struct= 1)):struct> -- !query output diff --git a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out index 7bfdd0ad53a95..50eb2a9f22f69 100644 --- a/sql/core/src/test/resources/sql-tests/results/group-by.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/group-by.sql.out @@ -87,7 +87,7 @@ struct -- !query SELECT 'foo', MAX(STRUCT(a)) FROM testData WHERE a = 0 GROUP BY 1 -- !query schema -struct> +struct> -- !query output diff --git a/sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out b/sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out index e59b9d5b63a40..7b7aeb4ec7934 100644 --- a/sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out @@ -4654,7 +4654,7 @@ struct -- !query select ln(1.2345678e-28) -- !query schema -struct +struct -- !query output -64.26166165451762 @@ -4662,7 +4662,7 @@ struct -- !query select ln(0.0456789) -- !query schema -struct +struct -- !query output -3.0861187944847437 @@ -4670,7 +4670,7 @@ struct -- !query select ln(0.99949452) -- !query schema -struct +struct -- !query output -5.056077980832118E-4 @@ -4678,7 +4678,7 @@ struct -- !query select ln(1.00049687395) -- !query schema -struct +struct -- !query output 4.967505490136803E-4 @@ -4686,7 +4686,7 @@ struct -- !query select ln(1234.567890123456789) -- !query schema -struct +struct -- !query output 7.11847630129779 @@ -4694,7 +4694,7 @@ struct -- !query select ln(5.80397490724e5) -- !query schema -struct +struct -- !query output 13.271468476626518 @@ -4702,7 +4702,7 @@ struct -- !query select ln(9.342536355e34) -- !query schema -struct +struct -- !query output 80.52247093552418 diff --git a/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out b/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out index 0d37c0d02e61f..20c31b140b009 100644 --- a/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/string-functions.sql.out @@ -55,7 +55,7 @@ struct -- !query select position('bar' in 'foobarbar'), position(null, 'foobarbar'), position('aaads', null) -- !query schema -struct +struct -- !query output 4 NULL NULL diff --git a/sql/core/src/test/resources/sql-tests/results/struct.sql.out b/sql/core/src/test/resources/sql-tests/results/struct.sql.out index f294c5213d319..3b610edc47169 100644 --- a/sql/core/src/test/resources/sql-tests/results/struct.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/struct.sql.out @@ -83,7 +83,7 @@ struct -- !query SELECT ID, STRUCT(ST.C as STC, ST.D as STD).STD FROM tbl_x -- !query schema -struct +struct -- !query output 1 delta 2 eta diff --git a/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/mapZipWith.sql.out b/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/mapZipWith.sql.out index ed7ab5a342c12..d046ff249379f 100644 --- a/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/mapZipWith.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/mapZipWith.sql.out @@ -85,7 +85,7 @@ FROM various_maps struct<> -- !query output org.apache.spark.sql.AnalysisException -cannot resolve 'map_zip_with(various_maps.`decimal_map1`, various_maps.`decimal_map2`, lambdafunction(named_struct(NamePlaceholder(), k, NamePlaceholder(), v1, NamePlaceholder(), v2), k, v1, v2))' due to argument data type mismatch: The input to function map_zip_with should have been two maps with compatible key types, but the key types are [decimal(36,0), decimal(36,35)].; line 1 pos 7 +cannot resolve 'map_zip_with(various_maps.`decimal_map1`, various_maps.`decimal_map2`, lambdafunction(struct(k, v1, v2), k, v1, v2))' due to argument data type mismatch: The input to function map_zip_with should have been two maps with compatible key types, but the key types are [decimal(36,0), decimal(36,35)].; line 1 pos 7 -- !query @@ -113,7 +113,7 @@ FROM various_maps struct<> -- !query output org.apache.spark.sql.AnalysisException -cannot resolve 'map_zip_with(various_maps.`decimal_map2`, various_maps.`int_map`, lambdafunction(named_struct(NamePlaceholder(), k, NamePlaceholder(), v1, NamePlaceholder(), v2), k, v1, v2))' due to argument data type mismatch: The input to function map_zip_with should have been two maps with compatible key types, but the key types are [decimal(36,35), int].; line 1 pos 7 +cannot resolve 'map_zip_with(various_maps.`decimal_map2`, various_maps.`int_map`, lambdafunction(struct(k, v1, v2), k, v1, v2))' due to argument data type mismatch: The input to function map_zip_with should have been two maps with compatible key types, but the key types are [decimal(36,35), int].; line 1 pos 7 -- !query diff --git a/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/stringCastAndExpressions.sql.out b/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/stringCastAndExpressions.sql.out index 8353c7e73d0bb..02944c268ed21 100644 --- a/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/stringCastAndExpressions.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/typeCoercion/native/stringCastAndExpressions.sql.out @@ -136,9 +136,10 @@ NULL -- !query select to_timestamp('2018-01-01', a) from t -- !query schema -struct +struct<> -- !query output -NULL +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html -- !query @@ -152,9 +153,10 @@ NULL -- !query select to_unix_timestamp('2018-01-01', a) from t -- !query schema -struct +struct<> -- !query output -NULL +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html -- !query @@ -168,9 +170,10 @@ NULL -- !query select unix_timestamp('2018-01-01', a) from t -- !query schema -struct +struct<> -- !query output -NULL +org.apache.spark.SparkUpgradeException +You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html -- !query diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out index 6403406413db9..da5256f5c0453 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out @@ -87,7 +87,7 @@ struct> +struct> -- !query output diff --git a/sql/core/src/test/resources/test-data/before_1582_date_v2_4.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_date_v2_4.snappy.parquet deleted file mode 100644 index 7d5cc12eefe04..0000000000000 Binary files a/sql/core/src/test/resources/test-data/before_1582_date_v2_4.snappy.parquet and /dev/null differ diff --git a/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet new file mode 100644 index 0000000000000..edd61c9b9fec8 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_date_v2_4_5.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_date_v2_4_6.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_date_v2_4_6.snappy.parquet new file mode 100644 index 0000000000000..01f4887f5e994 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_date_v2_4_6.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet new file mode 100644 index 0000000000000..c7e8d3926f63a Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_5.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet new file mode 100644 index 0000000000000..939e2b8088eb0 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_dict_v2_4_6.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet new file mode 100644 index 0000000000000..88a94ac482052 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_5.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet new file mode 100644 index 0000000000000..68bfa33aac13f Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_plain_v2_4_6.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_v2_4.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_v2_4.snappy.parquet deleted file mode 100644 index 13254bd93a5e6..0000000000000 Binary files a/sql/core/src/test/resources/test-data/before_1582_timestamp_int96_v2_4.snappy.parquet and /dev/null differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4.snappy.parquet deleted file mode 100644 index 7d2b46e9bea41..0000000000000 Binary files a/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4.snappy.parquet and /dev/null differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_5.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_5.snappy.parquet new file mode 100644 index 0000000000000..62e6048354dc1 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_5.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_6.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_6.snappy.parquet new file mode 100644 index 0000000000000..d7fdaa3e67212 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_micros_v2_4_6.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4.snappy.parquet deleted file mode 100644 index e9825455c2015..0000000000000 Binary files a/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4.snappy.parquet and /dev/null differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_5.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_5.snappy.parquet new file mode 100644 index 0000000000000..a7cef9e60f134 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_5.snappy.parquet differ diff --git a/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_6.snappy.parquet b/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_6.snappy.parquet new file mode 100644 index 0000000000000..4c213f4540a73 Binary files /dev/null and b/sql/core/src/test/resources/test-data/before_1582_timestamp_millis_v2_4_6.snappy.parquet differ diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala index 14e6ee2b04c14..c12468a4e70f8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala @@ -23,7 +23,7 @@ import java.time.{Instant, LocalDateTime, ZoneId} import java.util.{Locale, TimeZone} import java.util.concurrent.TimeUnit -import org.apache.spark.SparkException +import org.apache.spark.{SparkException, SparkUpgradeException} import org.apache.spark.sql.catalyst.util.DateTimeTestUtils.{CEST, LA} import org.apache.spark.sql.catalyst.util.DateTimeUtils import org.apache.spark.sql.functions._ @@ -450,9 +450,9 @@ class DateFunctionsSuite extends QueryTest with SharedSparkSession { checkAnswer( df.select(to_date(col("s"), "yyyy-hh-MM")), Seq(Row(null), Row(null), Row(null))) - checkAnswer( - df.select(to_date(col("s"), "yyyy-dd-aa")), - Seq(Row(null), Row(null), Row(null))) + val e = intercept[SparkUpgradeException](df.select(to_date(col("s"), "yyyy-dd-aa")).collect()) + assert(e.getCause.isInstanceOf[IllegalArgumentException]) + assert(e.getMessage.contains("You may get a different result due to the upgrading of Spark")) // february val x1 = "2016-02-29" @@ -618,8 +618,16 @@ class DateFunctionsSuite extends QueryTest with SharedSparkSession { Row(secs(ts4.getTime)), Row(null), Row(secs(ts3.getTime)), Row(null))) // invalid format - checkAnswer(df1.selectExpr(s"unix_timestamp(x, 'yyyy-MM-dd aa:HH:ss')"), Seq( - Row(null), Row(null), Row(null), Row(null))) + val invalid = df1.selectExpr(s"unix_timestamp(x, 'yyyy-MM-dd aa:HH:ss')") + if (legacyParserPolicy == "legacy") { + checkAnswer(invalid, + Seq(Row(null), Row(null), Row(null), Row(null))) + } else { + val e = intercept[SparkUpgradeException](invalid.collect()) + assert(e.getCause.isInstanceOf[IllegalArgumentException]) + assert( + e.getMessage.contains("You may get a different result due to the upgrading of Spark")) + } // february val y1 = "2016-02-29" diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala index eca39f3f81726..5c35cedba9bab 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala @@ -53,6 +53,7 @@ class QueryExecutionSuite extends SharedSparkSession { s"*(1) Range (0, $expected, step=1, splits=2)", "")) } + test("dumping query execution info to a file") { withTempDir { dir => val path = dir.getCanonicalPath + "/plans.txt" @@ -93,6 +94,25 @@ class QueryExecutionSuite extends SharedSparkSession { assert(exception.getMessage.contains("Illegal character in scheme name")) } + test("dumping query execution info to a file - explainMode=formatted") { + withTempDir { dir => + val path = dir.getCanonicalPath + "/plans.txt" + val df = spark.range(0, 10) + df.queryExecution.debug.toFile(path, explainMode = Option("formatted")) + assert(Source.fromFile(path).getLines.toList + .takeWhile(_ != "== Whole Stage Codegen ==").map(_.replaceAll("#\\d+", "#x")) == List( + "== Physical Plan ==", + s"* Range (1)", + "", + "", + s"(1) Range [codegen id : 1]", + "Output [1]: [id#xL]", + s"Arguments: Range (0, 10, step=1, splits=Some(2))", + "", + "")) + } + } + test("limit number of fields by sql config") { def relationPlans: String = { val ds = spark.createDataset(Seq(QueryExecutionTestRecord( diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala index 10ad8acc68937..e4709e469dca3 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala @@ -1203,14 +1203,24 @@ abstract class DDLSuite extends QueryTest with SQLTestUtils { } test("alter table: recover partitions (sequential)") { - withSQLConf(RDD_PARALLEL_LISTING_THRESHOLD.key -> "10") { + val oldRddParallelListingThreshold = spark.sparkContext.conf.get( + RDD_PARALLEL_LISTING_THRESHOLD) + try { + spark.sparkContext.conf.set(RDD_PARALLEL_LISTING_THRESHOLD.key, "10") testRecoverPartitions() + } finally { + spark.sparkContext.conf.set(RDD_PARALLEL_LISTING_THRESHOLD, oldRddParallelListingThreshold) } } test("alter table: recover partition (parallel)") { - withSQLConf(RDD_PARALLEL_LISTING_THRESHOLD.key -> "0") { + val oldRddParallelListingThreshold = spark.sparkContext.conf.get( + RDD_PARALLEL_LISTING_THRESHOLD) + try { + spark.sparkContext.conf.set(RDD_PARALLEL_LISTING_THRESHOLD.key, "0") testRecoverPartitions() + } finally { + spark.sparkContext.conf.set(RDD_PARALLEL_LISTING_THRESHOLD, oldRddParallelListingThreshold) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala index f075d04165697..79c32976f02ec 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala @@ -17,6 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet +import java.nio.file.{Files, Paths, StandardCopyOption} import java.sql.{Date, Timestamp} import java.time._ import java.util.Locale @@ -45,7 +46,7 @@ import org.apache.spark.{SPARK_VERSION_SHORT, SparkException, SparkUpgradeExcept import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.{InternalRow, ScalaReflection} import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, UnsafeRow} -import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.catalyst.util.{DateTimeTestUtils, DateTimeUtils} import org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol import org.apache.spark.sql.functions._ import org.apache.spark.sql.internal.SQLConf @@ -875,81 +876,152 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession } } + // It generates input files for the test below: + // "SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps" + ignore("SPARK-31806: generate test files for checking compatibility with Spark 2.4") { + val resourceDir = "sql/core/src/test/resources/test-data" + val version = "2_4_5" + val N = 8 + def save( + in: Seq[(String, String)], + t: String, + dstFile: String, + options: Map[String, String] = Map.empty): Unit = { + withTempDir { dir => + in.toDF("dict", "plain") + .select($"dict".cast(t), $"plain".cast(t)) + .repartition(1) + .write + .mode("overwrite") + .options(options) + .parquet(dir.getCanonicalPath) + Files.copy( + dir.listFiles().filter(_.getName.endsWith(".snappy.parquet")).head.toPath, + Paths.get(resourceDir, dstFile), + StandardCopyOption.REPLACE_EXISTING) + } + } + DateTimeTestUtils.withDefaultTimeZone(DateTimeTestUtils.LA) { + withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> DateTimeTestUtils.LA.getId) { + save( + (1 to N).map(i => ("1001-01-01", s"1001-01-0$i")), + "date", + s"before_1582_date_v$version.snappy.parquet") + withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MILLIS") { + save( + (1 to N).map(i => ("1001-01-01 01:02:03.123", s"1001-01-0$i 01:02:03.123")), + "timestamp", + s"before_1582_timestamp_millis_v$version.snappy.parquet") + } + val usTs = (1 to N).map(i => ("1001-01-01 01:02:03.123456", s"1001-01-0$i 01:02:03.123456")) + withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MICROS") { + save(usTs, "timestamp", s"before_1582_timestamp_micros_v$version.snappy.parquet") + } + withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "INT96") { + // Comparing to other logical types, Parquet-MR chooses dictionary encoding for the + // INT96 logical type because it consumes less memory for small column cardinality. + // Huge parquet files doesn't make sense to place to the resource folder. That's why + // we explicitly set `parquet.enable.dictionary` and generate two files w/ and w/o + // dictionary encoding. + save( + usTs, + "timestamp", + s"before_1582_timestamp_int96_plain_v$version.snappy.parquet", + Map("parquet.enable.dictionary" -> "false")) + save( + usTs, + "timestamp", + s"before_1582_timestamp_int96_dict_v$version.snappy.parquet", + Map("parquet.enable.dictionary" -> "true")) + } + } + } + } + test("SPARK-31159: compatibility with Spark 2.4 in reading dates/timestamps") { + val N = 8 // test reading the existing 2.4 files and new 3.0 files (with rebase on/off) together. - def checkReadMixedFiles(fileName: String, dt: String, dataStr: String): Unit = { + def checkReadMixedFiles[T]( + fileName: String, + catalystType: String, + rowFunc: Int => (String, String), + toJavaType: String => T, + checkDefaultLegacyRead: String => Unit, + tsOutputType: String = "TIMESTAMP_MICROS"): Unit = { withTempPaths(2) { paths => paths.foreach(_.delete()) val path2_4 = getResourceParquetFilePath("test-data/" + fileName) val path3_0 = paths(0).getCanonicalPath val path3_0_rebase = paths(1).getCanonicalPath - if (dt == "date") { - val df = Seq(dataStr).toDF("str").select($"str".cast("date").as("date")) - + val df = Seq.tabulate(N)(rowFunc).toDF("dict", "plain") + .select($"dict".cast(catalystType), $"plain".cast(catalystType)) + withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsOutputType) { + checkDefaultLegacyRead(path2_4) // By default we should fail to write ancient datetime values. - var e = intercept[SparkException](df.write.parquet(path3_0)) + val e = intercept[SparkException](df.write.parquet(path3_0)) assert(e.getCause.getCause.getCause.isInstanceOf[SparkUpgradeException]) - // By default we should fail to read ancient datetime values. - e = intercept[SparkException](spark.read.parquet(path2_4).collect()) - assert(e.getCause.isInstanceOf[SparkUpgradeException]) - withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> CORRECTED.toString) { df.write.mode("overwrite").parquet(path3_0) } withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> LEGACY.toString) { df.write.parquet(path3_0_rebase) } - - // For Parquet files written by Spark 3.0, we know the writer info and don't need the - // config to guide the rebase behavior. - withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { - checkAnswer( - spark.read.format("parquet").load(path2_4, path3_0, path3_0_rebase), - 1.to(3).map(_ => Row(java.sql.Date.valueOf(dataStr)))) - } - } else { - val df = Seq(dataStr).toDF("str").select($"str".cast("timestamp").as("ts")) - withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> dt) { - // By default we should fail to write ancient datetime values. - var e = intercept[SparkException](df.write.parquet(path3_0)) - assert(e.getCause.getCause.getCause.isInstanceOf[SparkUpgradeException]) - // By default we should fail to read ancient datetime values. - e = intercept[SparkException](spark.read.parquet(path2_4).collect()) - assert(e.getCause.isInstanceOf[SparkUpgradeException]) - - withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> CORRECTED.toString) { - df.write.mode("overwrite").parquet(path3_0) - } - withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_WRITE.key -> LEGACY.toString) { - df.write.parquet(path3_0_rebase) - } - } - // For Parquet files written by Spark 3.0, we know the writer info and don't need the - // config to guide the rebase behavior. - withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { - checkAnswer( - spark.read.format("parquet").load(path2_4, path3_0, path3_0_rebase), - 1.to(3).map(_ => Row(java.sql.Timestamp.valueOf(dataStr)))) - } + } + // For Parquet files written by Spark 3.0, we know the writer info and don't need the + // config to guide the rebase behavior. + withSQLConf(SQLConf.LEGACY_PARQUET_REBASE_MODE_IN_READ.key -> LEGACY.toString) { + checkAnswer( + spark.read.format("parquet").load(path2_4, path3_0, path3_0_rebase), + (0 until N).flatMap { i => + val (dictS, plainS) = rowFunc(i) + Seq.tabulate(3) { _ => + Row(toJavaType(dictS), toJavaType(plainS)) + } + }) } } } - - withAllParquetReaders { - checkReadMixedFiles("before_1582_date_v2_4.snappy.parquet", "date", "1001-01-01") - checkReadMixedFiles( - "before_1582_timestamp_micros_v2_4.snappy.parquet", - "TIMESTAMP_MICROS", - "1001-01-01 01:02:03.123456") - checkReadMixedFiles( - "before_1582_timestamp_millis_v2_4.snappy.parquet", - "TIMESTAMP_MILLIS", - "1001-01-01 01:02:03.123") - - // INT96 is a legacy timestamp format and we always rebase the seconds for it. - checkAnswer(readResourceParquetFile( - "test-data/before_1582_timestamp_int96_v2_4.snappy.parquet"), - Row(java.sql.Timestamp.valueOf("1001-01-01 01:02:03.123456"))) + def failInRead(path: String): Unit = { + val e = intercept[SparkException](spark.read.parquet(path).collect()) + assert(e.getCause.isInstanceOf[SparkUpgradeException]) + } + def successInRead(path: String): Unit = spark.read.parquet(path).collect() + Seq( + // By default we should fail to read ancient datetime values when parquet files don't + // contain Spark version. + "2_4_5" -> failInRead _, + "2_4_6" -> successInRead _).foreach { case (version, checkDefaultRead) => + withAllParquetReaders { + checkReadMixedFiles( + s"before_1582_date_v$version.snappy.parquet", + "date", + (i: Int) => ("1001-01-01", s"1001-01-0${i + 1}"), + java.sql.Date.valueOf, + checkDefaultRead) + checkReadMixedFiles( + s"before_1582_timestamp_micros_v$version.snappy.parquet", + "timestamp", + (i: Int) => ("1001-01-01 01:02:03.123456", s"1001-01-0${i + 1} 01:02:03.123456"), + java.sql.Timestamp.valueOf, + checkDefaultRead) + checkReadMixedFiles( + s"before_1582_timestamp_millis_v$version.snappy.parquet", + "timestamp", + (i: Int) => ("1001-01-01 01:02:03.123", s"1001-01-0${i + 1} 01:02:03.123"), + java.sql.Timestamp.valueOf, + checkDefaultRead, + tsOutputType = "TIMESTAMP_MILLIS") + // INT96 is a legacy timestamp format and we always rebase the seconds for it. + Seq("plain", "dict").foreach { enc => + checkAnswer(readResourceParquetFile( + s"test-data/before_1582_timestamp_int96_${enc}_v$version.snappy.parquet"), + Seq.tabulate(N) { i => + Row( + java.sql.Timestamp.valueOf("1001-01-01 01:02:03.123456"), + java.sql.Timestamp.valueOf(s"1001-01-0${i + 1} 01:02:03.123456")) + }) + } + } } } diff --git a/sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala b/sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala index a01d5a44da714..b68563956c82c 100644 --- a/sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala +++ b/sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.orc -import java.time.LocalDate +import java.time.{Instant, LocalDate} import org.apache.orc.storage.common.`type`.HiveDecimal import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument} @@ -26,7 +26,7 @@ import org.apache.orc.storage.ql.io.sarg.SearchArgumentFactory.newBuilder import org.apache.orc.storage.serde2.io.HiveDecimalWritable import org.apache.spark.SparkException -import org.apache.spark.sql.catalyst.util.DateTimeUtils.{localDateToDays, toJavaDate} +import org.apache.spark.sql.catalyst.util.DateTimeUtils.{instantToMicros, localDateToDays, toJavaDate, toJavaTimestamp} import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quoteIfNeeded import org.apache.spark.sql.sources.Filter import org.apache.spark.sql.types._ @@ -167,6 +167,8 @@ private[sql] object OrcFilters extends OrcFiltersBase { new HiveDecimalWritable(HiveDecimal.create(value.asInstanceOf[java.math.BigDecimal])) case _: DateType if value.isInstanceOf[LocalDate] => toJavaDate(localDateToDays(value.asInstanceOf[LocalDate])) + case _: TimestampType if value.isInstanceOf[Instant] => + toJavaTimestamp(instantToMicros(value.asInstanceOf[Instant])) case _ => value } diff --git a/sql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala b/sql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala index a1c325e7bb876..88b4b243b543a 100644 --- a/sql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala +++ b/sql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala @@ -245,29 +245,41 @@ class OrcFilterSuite extends OrcTest with SharedSparkSession { } test("filter pushdown - timestamp") { - val timeString = "2015-08-20 14:57:00" - val timestamps = (1 to 4).map { i => - val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600 - new Timestamp(milliseconds) - } - withOrcDataFrame(timestamps.map(Tuple1(_))) { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) + val input = Seq( + "1000-01-01 01:02:03", + "1582-10-01 00:11:22", + "1900-01-01 23:59:59", + "2020-05-25 10:11:12").map(Timestamp.valueOf) - checkFilterPredicate($"_1" === timestamps(0), PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> timestamps(0), PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < timestamps(1), PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > timestamps(2), PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= timestamps(0), PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= timestamps(3), PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(timestamps(0)) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(timestamps(0)) <=> $"_1", - PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(timestamps(1)) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(timestamps(2)) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(timestamps(0)) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(timestamps(3)) <= $"_1", PredicateLeaf.Operator.LESS_THAN) + withOrcFile(input.map(Tuple1(_))) { path => + Seq(false, true).foreach { java8Api => + withSQLConf(SQLConf.DATETIME_JAVA8API_ENABLED.key -> java8Api.toString) { + readFile(path) { implicit df => + val timestamps = input.map(Literal(_)) + checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate($"_1" === timestamps(0), PredicateLeaf.Operator.EQUALS) + checkFilterPredicate($"_1" <=> timestamps(0), PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate($"_1" < timestamps(1), PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate($"_1" > timestamps(2), PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate($"_1" <= timestamps(0), PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate($"_1" >= timestamps(3), PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(timestamps(0)) === $"_1", PredicateLeaf.Operator.EQUALS) + checkFilterPredicate( + Literal(timestamps(0)) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) + checkFilterPredicate(Literal(timestamps(1)) > $"_1", PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate( + Literal(timestamps(2)) < $"_1", + PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate( + Literal(timestamps(0)) >= $"_1", + PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(timestamps(3)) <= $"_1", PredicateLeaf.Operator.LESS_THAN) + } + } + } } } diff --git a/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala b/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala index 445a52cece1c3..4b642080d25ad 100644 --- a/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala +++ b/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.orc -import java.time.LocalDate +import java.time.{Instant, LocalDate} import org.apache.hadoop.hive.common.`type`.HiveDecimal import org.apache.hadoop.hive.ql.io.sarg.{PredicateLeaf, SearchArgument} @@ -26,7 +26,7 @@ import org.apache.hadoop.hive.ql.io.sarg.SearchArgumentFactory.newBuilder import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable import org.apache.spark.SparkException -import org.apache.spark.sql.catalyst.util.DateTimeUtils.{localDateToDays, toJavaDate} +import org.apache.spark.sql.catalyst.util.DateTimeUtils.{instantToMicros, localDateToDays, toJavaDate, toJavaTimestamp} import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quoteIfNeeded import org.apache.spark.sql.sources.Filter import org.apache.spark.sql.types._ @@ -167,6 +167,8 @@ private[sql] object OrcFilters extends OrcFiltersBase { new HiveDecimalWritable(HiveDecimal.create(value.asInstanceOf[java.math.BigDecimal])) case _: DateType if value.isInstanceOf[LocalDate] => toJavaDate(localDateToDays(value.asInstanceOf[LocalDate])) + case _: TimestampType if value.isInstanceOf[Instant] => + toJavaTimestamp(instantToMicros(value.asInstanceOf[Instant])) case _ => value } diff --git a/sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala b/sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala index 815af05beb002..2263179515a5f 100644 --- a/sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala +++ b/sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala @@ -246,29 +246,41 @@ class OrcFilterSuite extends OrcTest with SharedSparkSession { } test("filter pushdown - timestamp") { - val timeString = "2015-08-20 14:57:00" - val timestamps = (1 to 4).map { i => - val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600 - new Timestamp(milliseconds) - } - withOrcDataFrame(timestamps.map(Tuple1(_))) { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === timestamps(0), PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> timestamps(0), PredicateLeaf.Operator.NULL_SAFE_EQUALS) + val input = Seq( + "1000-01-01 01:02:03", + "1582-10-01 00:11:22", + "1900-01-01 23:59:59", + "2020-05-25 10:11:12").map(Timestamp.valueOf) - checkFilterPredicate($"_1" < timestamps(1), PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > timestamps(2), PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= timestamps(0), PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= timestamps(3), PredicateLeaf.Operator.LESS_THAN) + withOrcFile(input.map(Tuple1(_))) { path => + Seq(false, true).foreach { java8Api => + withSQLConf(SQLConf.DATETIME_JAVA8API_ENABLED.key -> java8Api.toString) { + readFile(path) { implicit df => + val timestamps = input.map(Literal(_)) + checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - checkFilterPredicate(Literal(timestamps(0)) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate( - Literal(timestamps(0)) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(timestamps(1)) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(timestamps(2)) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(timestamps(0)) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(timestamps(3)) <= $"_1", PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate($"_1" === timestamps(0), PredicateLeaf.Operator.EQUALS) + checkFilterPredicate($"_1" <=> timestamps(0), PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate($"_1" < timestamps(1), PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate($"_1" > timestamps(2), PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate($"_1" <= timestamps(0), PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate($"_1" >= timestamps(3), PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(timestamps(0)) === $"_1", PredicateLeaf.Operator.EQUALS) + checkFilterPredicate( + Literal(timestamps(0)) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) + checkFilterPredicate(Literal(timestamps(1)) > $"_1", PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate( + Literal(timestamps(2)) < $"_1", + PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate( + Literal(timestamps(0)) >= $"_1", + PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(timestamps(3)) <= $"_1", PredicateLeaf.Operator.LESS_THAN) + } + } + } } }