New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31353][SQL] Set a time zone in DateTimeBenchmark and DateTimeRebaseBenchmark #28127
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
MaxGekk
commented
Apr 5, 2020
after 1582, rebase on 20016 20016 0 5.0 200.2 0.5X | ||
before 1582, rebase off 20088 20088 0 5.0 200.9 0.5X | ||
before 1582, rebase on 20310 20310 0 4.9 203.1 0.5X | ||
after 1582, noop 18597 18597 0 5.4 186.0 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan @HyukjinKwon Overhead of preparing input data depends on the system time zone. I opened this PR to have a base line for the optimisation #28119
Test build #120838 has finished for PR 28127 at commit
|
thanks, merging to master/3.0! |
cloud-fan
pushed a commit
that referenced
this pull request
Apr 6, 2020
…ebaseBenchmark ### What changes were proposed in this pull request? In the PR, I propose to set the `America/Los_Angeles` time zone in the date-time benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` via `withDefaultTimeZone(LA)` and `withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> LA.getId)`. The results of affected benchmarks was given on an Amazon EC2 instance w/ the configuration: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | ### Why are the changes needed? Performance of date-time functions can depend on the system JVM time zone or SQL config `spark.sql.session.timeZone`. The changes allow to avoid any fluctuations of benchmarks results related to time zones, and set a reliable baseline for future optimization. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By regenerating results of DateTimeBenchmark and DateTimeRebaseBenchmark. Closes #28127 from MaxGekk/set-timezone-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 35e6a9d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan
pushed a commit
that referenced
this pull request
Apr 9, 2020
### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: | i | local timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes| | -- | ------- |----|----| ---- | |0|0001-01-01 00:00|-62135568422|-62135740800|-2872| |1|0100-03-01 00:00|-59006333222|-59006419200|-1432| |...|...|...|...|...| |13|1582-10-15 00:00|-12219264422|-12219264000|7| |14|1883-11-18 12:00|-2717640000|-2717640000|0| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ **x3.8 faster** - Loading timestamps from parquet files is ~**x2.8 faster**. - Loading timestamps by Vectorized reader ~**x4.6 faster**. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127): | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
MaxGekk
added a commit
to MaxGekk/spark
that referenced
this pull request
Apr 9, 2020
In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: | i | local timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes| | -- | ------- |----|----| ---- | |0|0001-01-01 00:00|-62135568422|-62135740800|-2872| |1|0100-03-01 00:00|-59006333222|-59006419200|-1432| |...|...|...|...|...| |13|1582-10-15 00:00|-12219264422|-12219264000|7| |14|1883-11-18 12:00|-2717640000|-2717640000|0| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ **x3.8 faster** - Loading timestamps from parquet files is ~**x2.8 faster**. - Loading timestamps by Vectorized reader ~**x4.6 faster**. No - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR apache#28127): | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes apache#28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e2d9399) Signed-off-by: Max Gekk <max.gekk@gmail.com>
cloud-fan
pushed a commit
that referenced
this pull request
Apr 10, 2020
### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: | i | local timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes| | -- | ------- |----|----| ---- | |0|0001-01-01 00:00|-62135568422|-62135740800|-2872| |1|0100-03-01 00:00|-59006333222|-59006419200|-1432| |...|...|...|...|...| |13|1582-10-15 00:00|-12219264422|-12219264000|7| |14|1883-11-18 12:00|-2717640000|-2717640000|0| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ **x3.8 faster** - Loading timestamps from parquet files is ~**x2.8 faster**. - Loading timestamps by Vectorized reader ~**x4.6 faster**. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127): | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28163 from MaxGekk/optimize-rebase-micros-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho
pushed a commit
to sjincho/spark
that referenced
this pull request
Apr 15, 2020
…ebaseBenchmark ### What changes were proposed in this pull request? In the PR, I propose to set the `America/Los_Angeles` time zone in the date-time benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` via `withDefaultTimeZone(LA)` and `withSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> LA.getId)`. The results of affected benchmarks was given on an Amazon EC2 instance w/ the configuration: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | ### Why are the changes needed? Performance of date-time functions can depend on the system JVM time zone or SQL config `spark.sql.session.timeZone`. The changes allow to avoid any fluctuations of benchmarks results related to time zones, and set a reliable baseline for future optimization. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By regenerating results of DateTimeBenchmark and DateTimeRebaseBenchmark. Closes apache#28127 from MaxGekk/set-timezone-in-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho
pushed a commit
to sjincho/spark
that referenced
this pull request
Apr 15, 2020
### What changes were proposed in this pull request? In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01: | i | local timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes| | -- | ------- |----|----| ---- | |0|0001-01-01 00:00|-62135568422|-62135740800|-2872| |1|0100-03-01 00:00|-59006333222|-59006419200|-1432| |...|...|...|...|...| |13|1582-10-15 00:00|-12219264422|-12219264000|7| |14|1883-11-18 12:00|-2717640000|-2717640000|0| The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps: 1. Look at the table, and find the time interval where the micros falls to 2. Take the difference between 2 calendars for this time interval 3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation. Here are details of the implementation: - Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example: ```json [ { "tz" : "America/Los_Angeles", "switches" : [ -62135740800, -59006419200, ... , -2717640000 ], "diffs" : [ 172378, 85978, ..., 0 ] } ] ``` The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default. The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test). - The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385) - The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case. The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime. - I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason. - I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation. - Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'. ### Why are the changes needed? To make timestamps rebasing faster: - Saving timestamps to parquet files is ~ **x3.8 faster** - Loading timestamps from parquet files is ~**x2.8 faster**. - Loading timestamps by Vectorized reader ~**x4.6 faster**. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`. - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function. - Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones. - Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar. - The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`. - Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR apache#28127): | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes apache#28119 from MaxGekk/optimize-rebase-micros. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In the PR, I propose to set the
America/Los_Angeles
time zone in the date-time benchmarksDateTimeBenchmark
andDateTimeRebaseBenchmark
viawithDefaultTimeZone(LA)
andwithSQLConf(SQLConf.SESSION_LOCAL_TIMEZONE.key -> LA.getId)
.The results of affected benchmarks was given on an Amazon EC2 instance w/ the configuration:
Why are the changes needed?
Performance of date-time functions can depend on the system JVM time zone or SQL config
spark.sql.session.timeZone
. The changes allow to avoid any fluctuations of benchmarks results related to time zones, and set a reliable baseline for future optimization.Does this PR introduce any user-facing change?
No
How was this patch tested?
By regenerating results of DateTimeBenchmark and DateTimeRebaseBenchmark.