Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of timezones #22631

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -20,6 +20,8 @@ package org.apache.spark.sql.catalyst.expressions
import java.sql.{Date, Timestamp}
import java.util.{Calendar, Locale, TimeZone}

import scala.util.Random

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
Expand Down Expand Up @@ -110,7 +112,7 @@ class CastSuite extends SparkFunSuite with ExpressionEvalHelper {
}

test("cast string to timestamp") {
for (tz <- ALL_TIMEZONES) {
for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 @gatorsmile surely we shouldn't do this. We're introducing nondeterminism into the test, and also no longer testing things we were testing before. If there's a lot of timezones we don't need to test, then, don't test them. Right? I didn't see this PR but would have objected to this change if I had seen it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I think the point here is to test that this works with different timezones. In DateExpressionsSuite, for instance, we test only 3-4 timezones. I don't think it makes sense to test every of the 650 possible timezones: if it works with many of them, then it means that the code is generic and respects timezones. We can also define a fixed subset of timezones, but IMHO taking randomly some of them provides the additional safety that if there is a specific single timezone creating problem, we are able to identify it on several subsequent runs.

We have many place where we generate data randomly in the test, so we already have randomness in the tests. I think the rationale behind them is the same: if the function works with some different data, then it generalize properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be deterministic, ideally; any sources of randomness should be seeded. Do you see one that isn't?

I think this is like deciding we'll run just 90% of all test suites every time randomly, to save time. I think it's just well against good practice.

There are other solutions:

  1. pick a subset of timezones that we're confident do exercise the code and just explicitly test those
  2. parallelize these tests within the test suite

The latter should be trivial in this case: ALL_TIMEZONES.par.foreach { tz => instead. It's the same amount of work but 8x, 16x faster by wall clock time, depending on how many cores are available. What about that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are many tests where data is randomly generated. And they are not seeded of course.

As I said, I think the goal here is to test that the function works well with different timezones: then picking a subset of timezones would be fine too, but I prefer taking them randomly among all because if there is a single timezone creating issues (very unlikely IMHO), we would discover it anyway (not on the single run though).

Moreover, it would be great then to be consistent among all the codebase on what we test. In DateExpressionsSuite we test only 3 timezones and here we test all 650: it is a weird, isn't it? We should probably define which is the right thing to do when timezones are involved and test always the same. Otherwise, testing 650 timezones on a single specific function and 3 on the most of the others is a nonsense IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely not by design? tests need to be deterministic, or else what's the value? failures can't be reproduced. (I know that in practice many things are hard to make deterministic.)

Of course, if you're worried that we might not be testing an important case, we have to test it. We can't just not test it sometimes to make some tests run a little faster.

Testing just 3 timezones might be fine too; I don't know. Testing 50 randomly seems suboptimal in all cases.

I'll open a PR to try simply testing in parallel instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests need to be deterministic in general as well.

In this particular case ideally, we should categorize timezones, pick up some timezones representing them and test fixed set. For instance, timezone with DST, without DST, and some exceptions such as, for instance, see this particular case which Python 3.6 addressed lately (https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/datetime.py#L1572-L1574), IMHO.

Of course, this approach requires a lot of investigations and overheads. So, as an alternative, I would incline to go for Sean's approach (https://github.com/apache/spark/pull/22631/files#r223224573) for this particular case.

For randomness, I think primarily we should have first deterministic set of tests. Maybe we could additionally have some set of randomized input to cover some cases we haven't foreseen but that's secondary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean some tests like with randomized input, let's say, integer range input are fine in common sense but this case is different, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that adding parallelism is a good way for improve test time. The amount of resources used for testing is anyway limited. I think the goal here is not (only) reduce the overall time of the test but also reduce the amount of resources needed to test.

Problems with a specific timezone like you mentioned, @HyukjinKwon, are exactly the reason why I am proposing this randomized approach, rather than picking 3 timezones and always use them as done in DateExpressionsSuite: if there is a problem with a specific timezone, in this way, we will be able to catch it. With a fixed subset of them (even though not on the single run), we are not.

The only safe deterministic way would be to run against all of them, as it was done before, but then I'd argue that we should do the same everywhere we have different timezones involved in tests (why are we testing all timezones only for casting to timestamp and not for all other functions involving dates and times, if it is so important to check all timezones?). But then the amount of time needed to run all the tests would be crazy, so it is not doable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should categorize timezones, pick up some timezones representing them and test fixed set

That would be the best, but we need some deep understanding of timezone, to make sure the test coverage is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't blocked on CPU time or resources, no. The tests are mostly single-threaded, and the big test machines mostly idle throughout the runs. I've opened #22672 to evaluate that.

I feel strongly that we can't do this kind of thing and am opening a thread on dev@ for wider discussion.

def checkCastStringToTimestamp(str: String, expected: Timestamp): Unit = {
checkEvaluation(cast(Literal(str), TimestampType, Option(tz.getID)), expected)
}
Expand Down