Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import com.univocity.parsers.csv.{CsvParserSettings, CsvWriterSettings, Unescape
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy

class CSVOptions(
@transient val parameters: CaseInsensitiveMap[String],
Expand Down Expand Up @@ -148,8 +149,12 @@ class CSVOptions(

val dateFormat: String = parameters.getOrElse("dateFormat", DateFormatter.defaultPattern)

val timestampFormat: String =
parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")
val timestampFormat: String = parameters.getOrElse("timestampFormat",
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"
} else {
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
})

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import com.fasterxml.jackson.core.json.JsonReadFeature
import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy

/**
* Options for parsing JSON data into Spark SQL rows.
Expand Down Expand Up @@ -90,8 +91,12 @@ private[sql] class JSONOptions(

val dateFormat: String = parameters.getOrElse("dateFormat", DateFormatter.defaultPattern)

val timestampFormat: String =
parameters.getOrElse("timestampFormat", s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX")
val timestampFormat: String = parameters.getOrElse("timestampFormat",
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
s"${DateFormatter.defaultPattern}'T'HH:mm:ss.SSSXXX"
} else {
s"${DateFormatter.defaultPattern}'T'HH:mm:ss[.SSS][XXX]"
})

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

Expand Down
88 changes: 44 additions & 44 deletions sql/core/benchmarks/CSVBenchmark-jdk11-results.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,66 +2,66 @@
Benchmark to measure CSV read/write performance
================================================================================================

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 44297 44515 373 0.0 885948.7 1.0X
One quoted string 24907 29374 NaN 0.0 498130.5 1.0X

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 196720 197783 1560 0.0 196719.8 1.0X
Select 100 columns 46691 46861 219 0.0 46691.4 4.2X
Select one column 36811 36922 111 0.0 36811.3 5.3X
count() 8520 8610 106 0.1 8520.5 23.1X
Select 100 columns, one bad input field 67914 67994 136 0.0 67914.0 2.9X
Select 100 columns, corrupt record field 77272 77445 214 0.0 77272.0 2.5X
Select 1000 columns 62811 63690 1416 0.0 62811.4 1.0X
Select 100 columns 23839 24064 230 0.0 23839.5 2.6X
Select one column 19936 20641 827 0.1 19936.4 3.2X
count() 4174 4380 206 0.2 4174.4 15.0X
Select 100 columns, one bad input field 41015 42380 1688 0.0 41015.4 1.5X
Select 100 columns, corrupt record field 46281 46338 93 0.0 46280.5 1.4X

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 25965 26054 103 0.4 2596.5 1.0X
Select 1 column + count() 18591 18666 91 0.5 1859.1 1.4X
count() 6102 6119 18 1.6 610.2 4.3X
Select 10 columns + count() 10810 10997 163 0.9 1081.0 1.0X
Select 1 column + count() 7608 7641 47 1.3 760.8 1.4X
count() 2415 2462 77 4.1 241.5 4.5X

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 2142 2161 17 4.7 214.2 1.0X
to_csv(timestamp) 14744 14950 182 0.7 1474.4 0.1X
write timestamps to files 12078 12202 175 0.8 1207.8 0.2X
Create a dataset of dates 2275 2291 18 4.4 227.5 0.9X
to_csv(date) 11407 11464 51 0.9 1140.7 0.2X
write dates to files 7638 7702 90 1.3 763.8 0.3X
Create a dataset of timestamps 874 914 37 11.4 87.4 1.0X
to_csv(timestamp) 7051 7223 250 1.4 705.1 0.1X
write timestamps to files 6712 6741 31 1.5 671.2 0.1X
Create a dataset of dates 909 945 35 11.0 90.9 1.0X
to_csv(date) 4222 4231 8 2.4 422.2 0.2X
write dates to files 3799 3813 14 2.6 379.9 0.2X

OpenJDK 64-Bit Server VM 11.0.5+10-post-Ubuntu-0ubuntu1.118.04 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2578 2590 10 3.9 257.8 1.0X
read timestamps from files 60103 60694 512 0.2 6010.3 0.0X
infer timestamps from files 107871 108268 351 0.1 10787.1 0.0X
read date text from files 2306 2310 4 4.3 230.6 1.1X
read date from files 47415 47657 367 0.2 4741.5 0.1X
infer date from files 35261 35447 164 0.3 3526.1 0.1X
timestamp strings 3045 3056 11 3.3 304.5 0.8X
parse timestamps from Dataset[String] 62221 63173 849 0.2 6222.1 0.0X
infer timestamps from Dataset[String] 118838 119629 697 0.1 11883.8 0.0X
date strings 3459 3481 19 2.9 345.9 0.7X
parse dates from Dataset[String] 51026 51447 503 0.2 5102.6 0.1X
from_csv(timestamp) 60738 61818 936 0.2 6073.8 0.0X
from_csv(date) 46012 46278 370 0.2 4601.2 0.1X
read timestamp text from files 1342 1364 35 7.5 134.2 1.0X
read timestamps from files 20300 20473 247 0.5 2030.0 0.1X
infer timestamps from files 40705 40744 54 0.2 4070.5 0.0X
read date text from files 1146 1151 6 8.7 114.6 1.2X
read date from files 12278 12408 117 0.8 1227.8 0.1X
infer date from files 12734 12872 220 0.8 1273.4 0.1X
timestamp strings 1467 1482 15 6.8 146.7 0.9X
parse timestamps from Dataset[String] 21708 22234 477 0.5 2170.8 0.1X
infer timestamps from Dataset[String] 42357 43253 922 0.2 4235.7 0.0X
date strings 1512 1532 18 6.6 151.2 0.9X
parse dates from Dataset[String] 13436 13470 33 0.7 1343.6 0.1X
from_csv(timestamp) 20390 20486 95 0.5 2039.0 0.1X
from_csv(date) 12592 12693 139 0.8 1259.2 0.1X

OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 11889 11945 52 0.0 118893.1 1.0X
pushdown disabled 11790 11860 115 0.0 117902.3 1.0X
w/ filters 1240 1278 33 0.1 12400.8 9.6X
w/o filters 12535 12606 67 0.0 125348.8 1.0X
pushdown disabled 12611 12672 91 0.0 126112.9 1.0X
w/ filters 1093 1099 11 0.1 10928.3 11.5X


Loading