[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

0x0FFF · 2015-09-01T12:38:35Z

This PR addresses SPARK-10162
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:

Timezone information of this datetime is ignored
This datetime is assumed to be in local timezone, which depends on the OS timezone setting

Fix includes both code change and regression test. Problem reproduction code on master:

import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))

m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')

df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()

It gives the same timestamp ignoring time zone:

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

After the fix:

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
 Scan PhysicalRDD[dt#0]

PR 8536 was occasionally closed by me dropping the repo

…ilter function

JoshRosen · 2015-09-01T17:07:13Z

Jenkins, this is ok to test.

SparkQA · 2015-09-01T17:19:45Z

Test build #41876 has finished for PR 8555 at commit 610cb3f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-09-01T18:18:29Z

@davies

davies · 2015-09-01T18:46:30Z

python/pyspark/sql/types.py

-
+        seconds = (calendar.timegm(obj.utctimetuple()) if obj.tzinfo
+                   else time.mktime(obj.timetuple()))
+        return Timestamp(int(seconds) * 1000 + obj.microsecond // 1000)


Existing: We should include microseconds by

t = Timestamp(int(seconds) * 1000) t.setNanos(obj.microsecond * 1000) return t

Fixed in commit 2acd285

SparkQA · 2015-09-01T18:59:43Z

Test build #41880 has finished for PR 8555 at commit cd63eb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-09-01T19:52:51Z

Test build #41885 has finished for PR 8555 at commit 2acd285.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-09-01T21:34:38Z

LGTM, merging this into master.

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe f…

610cb3f

…ilter function

Fixing indentation

cd63eb0

0x0FFF force-pushed the SPARK-10162 branch from ae9922f to cd63eb0 Compare September 1, 2015 18:25

davies reviewed Sep 1, 2015
View reviewed changes

[SPARK-10162] [SQL] Using setNanos call to set microseconds

2acd285

asfgit closed this in bf550a4 Sep 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

0x0FFF commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

andrewor14 commented Sep 1, 2015

davies Sep 1, 2015

0x0FFF Sep 1, 2015

SparkQA commented Sep 1, 2015

SparkQA commented Sep 1, 2015

davies commented Sep 1, 2015

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

Conversation

0x0FFF commented Sep 1, 2015

JoshRosen commented Sep 1, 2015

SparkQA commented Sep 1, 2015

andrewor14 commented Sep 1, 2015

davies Sep 1, 2015

Choose a reason for hiding this comment

0x0FFF Sep 1, 2015

Choose a reason for hiding this comment

SparkQA commented Sep 1, 2015

SparkQA commented Sep 1, 2015

davies commented Sep 1, 2015