Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Timestamp unit change not done in from_pandas() conversion #17688

Closed
asfimport opened this issue Oct 17, 2017 · 6 comments
Closed

[Python] Timestamp unit change not done in from_pandas() conversion #17688

asfimport opened this issue Oct 17, 2017 · 6 comments

Comments

@asfimport
Copy link

asfimport commented Oct 17, 2017

When calling Array.from_pandas with a pandas.Series of timestamps that have 'ns' unit and specifying a type to coerce to with 'us' causes problems. When the series has timestamps with a timezone, the unit is ignored. When the series does not have a timezone, it is applied but causes an OverflowError when printing.


>>> import pandas as pd
>>> import pyarrow as pa
>>> from datetime import datetime
>>> s = pd.Series([datetime.now()])
>>> s_nyc = s.dt.tz_localize('tzlocal()').dt.tz_convert('America/New_York')
>>> arr = pa.Array.from_pandas(s_nyc, type=pa.timestamp('us', tz='America/New_York'))
>>> arr.type
TimestampType(timestamp[ns, tz=America/New_York])
>>> arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
>>> arr.type
TimestampType(timestamp[us])
>>> print(arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
    values = array_format(self, window=10)
  File "pyarrow/formatting.py", line 28, in array_format
    values.append(value_format(x, 0))
  File "pyarrow/formatting.py", line 49, in value_format
    return repr(x)
  File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
    return repr(self.as_py())
  File "pyarrow/scalar.pxi", line 240, in pyarrow.lib.TimestampValue.as_py (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:21600)
    return converter(value, tzinfo=tzinfo)
  File "pyarrow/scalar.pxi", line 204, in pyarrow.lib.lambda5 (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7295)
    TimeUnit_MICRO: lambda x, tzinfo: pd.Timestamp(
  File "pandas/_libs/tslib.pyx", line 402, in pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)
  File "pandas/_libs/tslib.pyx", line 1467, in pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665)
OverflowError: Python int too large to convert to C long

A workaround is to manually change values with astype


>>> arr = pa.Array.from_pandas(s.values.astype('datetime64[us]'))
>>> arr.type
TimestampType(timestamp[us])
>>> print(arr)
<pyarrow.lib.TimestampArray object at 0x7f6a67e0a3c0>
[
  Timestamp('2017-10-17 11:04:44.308233')
]
>>> 

Reporter: Bryan Cutler / @BryanCutler
Assignee: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-1680. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Bryan Cutler / @BryanCutler:
repro script

import pandas as pd
import pyarrow as pa
from datetime import datetime
s = pd.Series([datetime.now()])
s_nyc = s.dt.tz_localize('tzlocal()').dt.tz_convert('America/New_York')
arr = pa.Array.from_pandas(s_nyc, type=pa.timestamp('us', tz='America/New_York'))
arr.type
arr = pa.Array.from_pandas(s, type=pa.timestamp('us'))
arr.type
print(arr)

@asfimport
Copy link
Author

Bryan Cutler / @BryanCutler:
@wesm should pyarrow convert the pandas series to the specified unit and timezone in from_pandas? Do you see any potential issues with the workaround using s.values.astype('datetime64[us]')? Thanks!

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@BryanCutler we don't have the casts implemented yet. I will prioritize this for 0.8.0 since it's causing an issue for you.

@asfimport
Copy link
Author

Bryan Cutler / @BryanCutler:
Thanks @wesm. I'm also seeing another related issue with dates

import pandas as pd
import pyarrow as pa
import datetime

arr = pa.array([datetime.date(2017, 10, 23)])
c = pa.Column.from_array("d", arr)

s = c.to_pandas()
print(s)
# 0   2017-10-23
# Name: d, dtype: datetime64[ns]

result = pa.Array.from_pandas(s, type=pa.date32())
print(result)
"""
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 295, in pyarrow.lib.Array.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:26221)
  File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 28, in array_format
    values.append(value_format(x, 0))
  File "/home/bryan/.local/lib/python2.7/site-packages/pyarrow-0.7.2.dev21+ng028f2cd-py2.7-linux-x86_64.egg/pyarrow/formatting.py", line 49, in value_format
    return repr(x)
  File "pyarrow/scalar.pxi", line 63, in pyarrow.lib.ArrayValue.__repr__ (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:19535)
  File "pyarrow/scalar.pxi", line 137, in pyarrow.lib.Date32Value.as_py (/home/bryan/git/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:20368)
ValueError: year is out of range
"""

This is a little more troublesome because I can't find a decent workaround. Should I open another jira for this?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Yes, please

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Resolved in 54d5c81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants