Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function #8555

Closed
wants to merge 3 commits into from

Conversation

0x0FFF
Copy link
Contributor

@0x0FFF 0x0FFF commented Sep 1, 2015

This PR addresses SPARK-10162
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:

  • Timezone information of this datetime is ignored
  • This datetime is assumed to be in local timezone, which depends on the OS timezone setting

Fix includes both code change and regression test. Problem reproduction code on master:

import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))

m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')

df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()

It gives the same timestamp ignoring time zone:

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

After the fix:

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
 Scan PhysicalRDD[dt#0]

PR 8536 was occasionally closed by me dropping the repo

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41876 has finished for PR 8555 at commit 610cb3f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

@davies


seconds = (calendar.timegm(obj.utctimetuple()) if obj.tzinfo
else time.mktime(obj.timetuple()))
return Timestamp(int(seconds) * 1000 + obj.microsecond // 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing: We should include microseconds by

t = Timestamp(int(seconds) * 1000)
t.setNanos(obj.microsecond * 1000)
return t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit 2acd285

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41880 has finished for PR 8555 at commit cd63eb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2015

Test build #41885 has finished for PR 8555 at commit 2acd285.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Sep 1, 2015

LGTM, merging this into master.

@asfgit asfgit closed this in bf550a4 Sep 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants