Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows #36566

Closed
wants to merge 13 commits into from

Conversation

dingsl-giser
Copy link

What changes were proposed in this pull request?

Fix problems with pyspark in Windows:

  1. Fixed datetime conversion to timestamp before 1970;
  2. Fixed datetime conversion when timestamp is negative;
  3. Adding a test script.

Why are the changes needed?

Pyspark has problems serializing pre-1970 times in Windows.

An exception occurs when executing the following code under Windows:

rdd = sc.parallelize([('a', datetime(1957, 1, 9, 0, 0)),
                      ('b', datetime(2014, 1, 27, 0, 0))])
df = spark.createDataFrame(rdd, ["id", "date"])

df.show()
df.printSchema()

print(df.collect())
  File "...\spark\python\lib\pyspark.zip\pyspark\sql\types.py", line 195, in toInternal
    else time.mktime(dt.timetuple()))
OverflowError: mktime argument out of range

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:638)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	... 1 more

and

File ...\spark\python\lib\pyspark.zip\pyspark\sql\types.py, in fromInternal:
Line 207:   return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)

OSError: [Errno 22] Invalid argument

After updating the code, the above code was run successfully!

+---+-------------------+
| id|               date|
+---+-------------------+
|  a|1957-01-08 16:00:00|
|  b|2014-01-26 16:00:00|
+---+-------------------+

root
 |-- id: string (nullable = true)
 |-- date: timestamp (nullable = true)

[Row(id='a', date=datetime.datetime(1957, 1, 9, 0, 0)), Row(id='b', date=datetime.datetime(2014, 1, 27, 0, 0))] 

Does this PR introduce any user-facing change?

No

How was this patch tested?

New and existing test suites

…970 datetime in windows

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows
[SPARK-39176][PYSPARK] Pre - 1970 time serialization test
[SPARK-39176][PYSPARK] Pre - 1970 time serialization test
@dingsl-giser
Copy link
Author

@HyukjinKwon I closed the RP in the 3.0 branch(#36537) and raised a new RP in the master branch.

[SPARK-39176][PYSPARK] Pre - 1970 time serialization test
…970 datetime in windows

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows
…970 datetime in windows

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows
seconds = (
calendar.timegm(dt.utctimetuple()) if dt.tzinfo else time.mktime(dt.timetuple())
)
if platform.system().lower() == 'windows':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check if the value is negative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need, this method is still available when the negative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about performance? This is hot path that's executed every value, and the performance here is pretty critical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there professional performance testers?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 17, 2022

@AnywalkerGiser BTW, it would be easier to follow and review if you have any link about describing the OS difference in that Python library.

@dingsl-giser
Copy link
Author

@HyukjinKwon Do you mean the platform library, added to the code comments?

@HyukjinKwon
Copy link
Member

I am surprised that the same Python library returns a different value depending on OS. I know some libraries are dependent on C library implementation but didn't expect such drastic difference. Code comment or documentation anything is fine. I would like to see if this is an official difference in Python.

@dingsl-giser
Copy link
Author

@HyukjinKwon Python3 datetime interface documentation
This is explained in this document.
image

@HyukjinKwon
Copy link
Member

So is C localtime function has a different behaviour in OS? that returns negative values?

@dingsl-giser
Copy link
Author

The localtime function does not differ in OS, the fromtimestamp function does, and datetime many functions have problems resolving dates before 1970 in windows.

@dingsl-giser
Copy link
Author

@HyukjinKwon Can this solution be merged into master?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 17, 2022

I would like to understand the problem first, and see if this behaviour difference is intentional or official. This is core code path so would have to expect more reviews and time.

…970 datetime in windows

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows
[SPARK-39176][PYSPARK] Pre - 1970 time serialization test
@dingsl-giser
Copy link
Author

Understood, looking forward to your reply later.

@HyukjinKwon
Copy link
Member

@AnywalkerGiser mind pointing out any documentation that states this negative value?

@dingsl-giser
Copy link
Author

Here are some blogs on related issues:
Python | mktime overflow error
Python fromtimestamp OSError

@dingsl-giser
Copy link
Author

I think the Spark project can look for a solution if Python doesn't fix this bug.

@HyukjinKwon
Copy link
Member

Does that happen in all Windows with all Python versions?

@HyukjinKwon
Copy link
Member

is this a bug in Python?

@dingsl-giser
Copy link
Author

I thought it was a bug in python, but the documentation said it would report an error if it was out of time range. I tested python 3.6, 3.7, and 3.8 on windows.

…970 datetime in windows

[SPARK-39176][PYSPARK] Fixed a problem with pyspark serializing pre-1970 datetime in windows
seconds = (
calendar.timegm(dt.utctimetuple()) if dt.tzinfo else time.mktime(dt.timetuple())
)
if platform.system().lower() == "windows":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just super surprised this is Windows specific?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I want you to check it out for yourself.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Oct 10, 2022
@github-actions github-actions bot closed this Oct 11, 2022
@dingsl-giser
Copy link
Author

This problem hasn't been solved yet?

@dingsl-giser
Copy link
Author

@HyukjinKwon This problem still exists in the new version. Can it be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants