-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures with upcoming pandas 1.4 release #8580
Comments
Thanks for raising this @jorisvandenbossche! It is indeed good to see these all in one place. If you have time to take a look at any of the PRs it would definitely be appreciated. |
I updated the list with the PRs that I think address each issue. |
Alright pandas 1.4 has been released and as expected there are some failures in the python 3.9 environments and in doctests where we pick up latest pandas. |
I'm opening a PR now to skip append tests. |
@martindurant do you have time to take a look at the timezone related issues with fastparquet? |
Here is the output from the failing fastparquet tests: Details=============================================== test session starts ===============================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/julia/conda/envs/dask-3.9/bin/python
cachedir: .pytest_cache
rootdir: /home/julia/dask, configfile: setup.cfg
plugins: rerunfailures-10.2, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 876 items / 873 deselected / 3 selected
run-last-failure: rerun previous 3 failures
dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] FAILED [ 33%]
dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] FAILED [ 66%]
dask/dataframe/io/tests/test_parquet.py::test_timestamp96 FAILED [100%]
==================================================== FAILURES =====================================================
__________________________ test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] __________________________
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
result = result.reshape(data.shape, order=order)
except ValueError as err:
try:
> values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2211:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: Unrecognized value type: <class 'str'>
pandas/_libs/tslibs/conversion.pyx:360: TypeError
During handling of the above exception, another exception occurred:
column = Series([], Name: x, dtype: datetime64[ns, UTC]), name = 'x'
def get_column_metadata(column, name):
"""Produce pandas column metadata block"""
# from pyarrow.pandas_compat
# https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py
inferred_dtype = infer_dtype(column)
dtype = column.dtype
if str(dtype) == "bool":
# pandas accidentally calls this "boolean"
inferred_dtype = "bool"
if is_categorical_dtype(dtype):
extra_metadata = {
'num_categories': len(column.cat.categories),
'ordered': column.cat.ordered,
}
dtype = column.cat.codes.dtype
elif hasattr(dtype, 'tz'):
try:
stz = str(dtype.tz)
if "UTC" in stz and ":" in stz:
extra_metadata = {'timezone': stz.strip("UTC")}
elif "pytz" not in stz:
> pd.Series([pd.to_datetime('now')]).dt.tz_localize(stz)
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/util.py:314:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = 'now', errors = 'raise', dayfirst = False, yearfirst = False, utc = None, format = None, exact = True
unit = None, infer_datetime_format = False, origin = 'unix', cache = True
def to_datetime(
arg: DatetimeScalarOrArrayConvertible,
errors: str = "raise",
dayfirst: bool = False,
yearfirst: bool = False,
utc: bool | None = None,
format: str | None = None,
exact: bool = True,
unit: str | None = None,
infer_datetime_format: bool = False,
origin="unix",
cache: bool = True,
) -> DatetimeIndex | Series | DatetimeScalar | NaTType | None:
"""
Convert argument to datetime.
This function converts a scalar, array-like, :class:`Series` or
:class:`DataFrame`/dict-like to a pandas datetime object.
Parameters
----------
arg : int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
The object to convert to a datetime. If a :class:`DataFrame` is provided, the
method expects minimally the following columns: :const:`"year"`,
:const:`"month"`, :const:`"day"`.
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
- If :const:`'raise'`, then invalid parsing will raise an exception.
- If :const:`'coerce'`, then invalid parsing will be set as :const:`NaT`.
- If :const:`'ignore'`, then invalid parsing will return the input.
dayfirst : bool, default False
Specify a date parse order if `arg` is str or is list-like.
If :const:`True`, parses dates with the day first, e.g. :const:`"10/11/12"`
is parsed as :const:`2012-11-10`.
.. warning::
``dayfirst=True`` is not strict, but will prefer to parse
with day first. If a delimited date string cannot be parsed in
accordance with the given `dayfirst` option, e.g.
``to_datetime(['31-12-2021'])``, then a warning will be shown.
yearfirst : bool, default False
Specify a date parse order if `arg` is str or is list-like.
- If :const:`True` parses dates with the year first, e.g.
:const:`"10/11/12"` is parsed as :const:`2010-11-12`.
- If both `dayfirst` and `yearfirst` are :const:`True`, `yearfirst` is
preceded (same as :mod:`dateutil`).
.. warning::
``yearfirst=True`` is not strict, but will prefer to parse
with year first.
utc : bool, default None
Control timezone-related parsing, localization and conversion.
- If :const:`True`, the function *always* returns a timezone-aware
UTC-localized :class:`Timestamp`, :class:`Series` or
:class:`DatetimeIndex`. To do this, timezone-naive inputs are
*localized* as UTC, while timezone-aware inputs are *converted* to UTC.
- If :const:`False` (default), inputs will not be coerced to UTC.
Timezone-naive inputs will remain naive, while timezone-aware ones
will keep their time offsets. Limitations exist for mixed
offsets (typically, daylight savings), see :ref:`Examples
<to_datetime_tz_examples>` section for details.
See also: pandas general documentation about `timezone conversion and
localization
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#time-zone-handling>`_.
format : str, default None
The strftime to parse time, e.g. :const:`"%d/%m/%Y"`. Note that
:const:`"%f"` will parse all the way up to nanoseconds. See
`strftime documentation
<https://docs.python.org/3/library/datetime.html
#strftime-and-strptime-behavior>`_ for more information on choices.
exact : bool, default True
Control how `format` is used:
- If :const:`True`, require an exact `format` match.
- If :const:`False`, allow the `format` to match anywhere in the target
string.
unit : str, default 'ns'
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
integer or float number. This will be based off the origin.
Example, with ``unit='ms'`` and ``origin='unix'`` (the default), this
would calculate the number of milliseconds to the unix epoch start.
infer_datetime_format : bool, default False
If :const:`True` and no `format` is given, attempt to infer the format
of the datetime strings based on the first non-NaN element,
and if it can be inferred, switch to a faster method of parsing them.
In some cases this can increase the parsing speed by ~5-10x.
origin : scalar, default 'unix'
Define the reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
- If :const:`'unix'` (or POSIX) time; origin is set to 1970-01-01.
- If :const:`'julian'`, unit must be :const:`'D'`, and origin is set to
beginning of Julian Calendar. Julian day number :const:`0` is assigned
to the day starting at noon on January 1, 4713 BC.
- If Timestamp convertible, origin is set to Timestamp identified by
origin.
cache : bool, default True
If :const:`True`, use a cache of unique, converted dates to apply the
datetime conversion. May produce significant speed-up when parsing
duplicate date strings, especially ones with timezone offsets. The cache
is only used when there are at least 50 values. The presence of
out-of-bounds values will render the cache unusable and may slow down
parsing.
.. versionchanged:: 0.25.0
changed default value from :const:`False` to :const:`True`.
Returns
-------
datetime
If parsing succeeded.
Return type depends on input (types in parenthesis correspond to
fallback in case of unsuccessful timezone or out-of-range timestamp
parsing):
- scalar: :class:`Timestamp` (or :class:`datetime.datetime`)
- array-like: :class:`DatetimeIndex` (or :class:`Series` with
:class:`object` dtype containing :class:`datetime.datetime`)
- Series: :class:`Series` of :class:`datetime64` dtype (or
:class:`Series` of :class:`object` dtype containing
:class:`datetime.datetime`)
- DataFrame: :class:`Series` of :class:`datetime64` dtype (or
:class:`Series` of :class:`object` dtype containing
:class:`datetime.datetime`)
Raises
------
ParserError
When parsing a date from string fails.
ValueError
When another datetime conversion error happens. For example when one
of 'year', 'month', day' columns is missing in a :class:`DataFrame`, or
when a Timezone-aware :class:`datetime.datetime` is found in an array-like
of mixed time offsets, and ``utc=False``.
See Also
--------
DataFrame.astype : Cast argument to a specified dtype.
to_timedelta : Convert argument to timedelta.
convert_dtypes : Convert dtypes.
Notes
-----
Many input types are supported, and lead to different output types:
- **scalars** can be int, float, str, datetime object (from stdlib :mod:`datetime`
module or :mod:`numpy`). They are converted to :class:`Timestamp` when
possible, otherwise they are converted to :class:`datetime.datetime`.
None/NaN/null scalars are converted to :const:`NaT`.
- **array-like** can contain int, float, str, datetime objects. They are
converted to :class:`DatetimeIndex` when possible, otherwise they are
converted to :class:`Index` with :class:`object` dtype, containing
:class:`datetime.datetime`. None/NaN/null entries are converted to
:const:`NaT` in both cases.
- **Series** are converted to :class:`Series` with :class:`datetime64`
dtype when possible, otherwise they are converted to :class:`Series` with
:class:`object` dtype, containing :class:`datetime.datetime`. None/NaN/null
entries are converted to :const:`NaT` in both cases.
- **DataFrame/dict-like** are converted to :class:`Series` with
:class:`datetime64` dtype. For each row a datetime is created from assembling
the various dataframe columns. Column keys can be common abbreviations
like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or
plurals of the same.
The following causes are responsible for :class:`datetime.datetime` objects
being returned (possibly inside an :class:`Index` or a :class:`Series` with
:class:`object` dtype) instead of a proper pandas designated type
(:class:`Timestamp`, :class:`DatetimeIndex` or :class:`Series`
with :class:`datetime64` dtype):
- when any input element is before :const:`Timestamp.min` or after
:const:`Timestamp.max`, see `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_.
- when ``utc=False`` (default) and the input is an array-like or
:class:`Series` containing mixed naive/aware datetime, or aware with mixed
time offsets. Note that this happens in the (quite frequent) situation when
the timezone has a daylight savings policy. In that case you may wish to
use ``utc=True``.
Examples
--------
**Handling various input formats**
Assembling a datetime from multiple columns of a :class:`DataFrame`. The keys
can be common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Passing ``infer_datetime_format=True`` can often-times speedup a parsing
if its not an ISO8601 format exactly, but in a regular format.
>>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
>>> s.head()
0 3/11/2000
1 3/12/2000
2 3/13/2000
3 3/11/2000
4 3/12/2000
dtype: object
>>> %timeit pd.to_datetime(s, infer_datetime_format=True) # doctest: +SKIP
100 loops, best of 3: 10.4 ms per loop
>>> %timeit pd.to_datetime(s, infer_datetime_format=False) # doctest: +SKIP
1 loop, best of 3: 471 ms per loop
Using a unix epoch time
>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
.. warning:: For float arg, precision rounding might happen. To prevent
unexpected behavior use a fixed-width exact type.
Using a non-unix epoch origin
>>> pd.to_datetime([1, 2, 3], unit='D',
... origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
dtype='datetime64[ns]', freq=None)
**Non-convertible date/times**
If a date does not meet the `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_, passing ``errors='ignore'``
will return the original input instead of raising any exception.
Passing ``errors='coerce'`` will force an out-of-bounds date to :const:`NaT`,
in addition to forcing non-dates (or non-parseable dates) to :const:`NaT`.
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT
.. _to_datetime_tz_examples:
**Timezones and time offsets**
The default behaviour (``utc=False``) is as follows:
- Timezone-naive inputs are converted to timezone-naive :class:`DatetimeIndex`:
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00:15'])
DatetimeIndex(['2018-10-26 12:00:00', '2018-10-26 13:00:15'],
dtype='datetime64[ns]', freq=None)
- Timezone-aware inputs *with constant time offset* are converted to
timezone-aware :class:`DatetimeIndex`:
>>> pd.to_datetime(['2018-10-26 12:00 -0500', '2018-10-26 13:00 -0500'])
DatetimeIndex(['2018-10-26 12:00:00-05:00', '2018-10-26 13:00:00-05:00'],
dtype='datetime64[ns, pytz.FixedOffset(-300)]', freq=None)
- However, timezone-aware inputs *with mixed time offsets* (for example
issued from a timezone with daylight savings, such as Europe/Paris)
are **not successfully converted** to a :class:`DatetimeIndex`. Instead a
simple :class:`Index` containing :class:`datetime.datetime` objects is
returned:
>>> pd.to_datetime(['2020-10-25 02:00 +0200', '2020-10-25 04:00 +0100'])
Index([2020-10-25 02:00:00+02:00, 2020-10-25 04:00:00+01:00],
dtype='object')
- A mix of timezone-aware and timezone-naive inputs is converted to
a timezone-aware :class:`DatetimeIndex` if the offsets of the timezone-aware
are constant:
>>> from datetime import datetime
>>> pd.to_datetime(["2020-01-01 01:00 -01:00", datetime(2020, 1, 1, 3, 0)])
DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'],
dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
- Finally, mixing timezone-aware strings and :class:`datetime.datetime` always
raises an error, even if the elements all have the same time offset.
>>> from datetime import datetime, timezone, timedelta
>>> d = datetime(2020, 1, 1, 18, tzinfo=timezone(-timedelta(hours=1)))
>>> pd.to_datetime(["2020-01-01 17:00 -0100", d])
Traceback (most recent call last):
...
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64
unless utc=True
|
Setting ``utc=True`` solves most of the above issues:
- Timezone-naive inputs are *localized* as UTC
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
- Timezone-aware inputs are *converted* to UTC (the output represents the
exact same datetime, but viewed from the UTC time offset `+00:00`).
>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
... utc=True)
DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
- Inputs can contain both naive and aware, string or datetime, the above
rules still apply
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 12:00 -0530',
... datetime(2020, 1, 1, 18),
... datetime(2020, 1, 1, 18,
... tzinfo=timezone(-timedelta(hours=1)))],
... utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 17:30:00+00:00',
'2020-01-01 18:00:00+00:00', '2020-01-01 19:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
"""
if arg is None:
return None
if origin != "unix":
arg = _adjust_to_origin(arg, origin, unit)
tz = "utc" if utc else None
convert_listlike = partial(
_convert_listlike_datetimes,
tz=tz,
unit=unit,
dayfirst=dayfirst,
yearfirst=yearfirst,
errors=errors,
exact=exact,
infer_datetime_format=infer_datetime_format,
)
result: Timestamp | NaTType | Series | Index
if isinstance(arg, Timestamp):
result = arg
if tz is not None:
if arg.tz is not None:
result = arg.tz_convert(tz)
else:
result = arg.tz_localize(tz)
elif isinstance(arg, ABCSeries):
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
if not cache_array.empty:
result = arg.map(cache_array)
else:
values = convert_listlike(arg._values, format)
result = arg._constructor(values, index=arg.index, name=arg.name)
elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):
result = _assemble_from_unit_mappings(arg, errors, tz)
elif isinstance(arg, Index):
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
if not cache_array.empty:
result = _convert_and_box_cache(arg, cache_array, name=arg.name)
else:
result = convert_listlike(arg, format, name=arg.name)
elif is_list_like(arg):
try:
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
except OutOfBoundsDatetime:
# caching attempts to create a DatetimeIndex, which may raise
# an OOB. If that's the desired behavior, then just reraise...
if errors == "raise":
raise
# ... otherwise, continue without the cache.
from pandas import Series
cache_array = Series([], dtype=object) # just an empty array
if not cache_array.empty:
result = _convert_and_box_cache(arg, cache_array)
else:
result = convert_listlike(arg, format)
else:
> result = convert_listlike(np.array([arg]), format)[0]
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1078:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = array(['now'], dtype=object), format = None, name = None, tz = None, unit = None, errors = 'raise'
infer_datetime_format = False, dayfirst = False, yearfirst = False, exact = True
def _convert_listlike_datetimes(
arg,
format: str | None,
name: Hashable = None,
tz: Timezone | None = None,
unit: str | None = None,
errors: str = "raise",
infer_datetime_format: bool = False,
dayfirst: bool | None = None,
yearfirst: bool | None = None,
exact: bool = True,
):
"""
Helper function for to_datetime. Performs the conversions of 1D listlike
of dates
Parameters
----------
arg : list, tuple, ndarray, Series, Index
date to be parsed
name : object
None or string for the Index name
tz : object
None or 'utc'
unit : str
None or string of the frequency of the passed data
errors : str
error handing behaviors from to_datetime, 'raise', 'coerce', 'ignore'
infer_datetime_format : bool, default False
inferring format behavior from to_datetime
dayfirst : bool
dayfirst parsing behavior from to_datetime
yearfirst : bool
yearfirst parsing behavior from to_datetime
exact : bool, default True
exact format matching behavior from to_datetime
Returns
-------
Index-like of parsed dates
"""
if isinstance(arg, (list, tuple)):
arg = np.array(arg, dtype="O")
arg_dtype = getattr(arg, "dtype", None)
# these are shortcutable
if is_datetime64tz_dtype(arg_dtype):
if not isinstance(arg, (DatetimeArray, DatetimeIndex)):
return DatetimeIndex(arg, tz=tz, name=name)
if tz == "utc":
arg = arg.tz_convert(None).tz_localize(tz)
return arg
elif is_datetime64_ns_dtype(arg_dtype):
if not isinstance(arg, (DatetimeArray, DatetimeIndex)):
try:
return DatetimeIndex(arg, tz=tz, name=name)
except ValueError:
pass
elif tz:
# DatetimeArray, DatetimeIndex
return arg.tz_localize(tz)
return arg
elif unit is not None:
if format is not None:
raise ValueError("cannot specify both format and unit")
return _to_datetime_with_unit(arg, unit, name, tz, errors)
elif getattr(arg, "ndim", 1) > 1:
raise TypeError(
"arg must be a string, datetime, list, tuple, 1-d array, or Series"
)
# warn if passing timedelta64, raise for PeriodDtype
# NB: this must come after unit transformation
orig_arg = arg
try:
arg, _ = maybe_convert_dtype(arg, copy=False)
except TypeError:
if errors == "coerce":
npvalues = np.array(["NaT"], dtype="datetime64[ns]").repeat(len(arg))
return DatetimeIndex(npvalues, name=name)
elif errors == "ignore":
idx = Index(arg, name=name)
return idx
raise
arg = ensure_object(arg)
require_iso8601 = False
if infer_datetime_format and format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
if format is not None:
# There is a special fast-path for iso8601 formatted
# datetime strings, so in those cases don't use the inferred
# format because this path makes process slower in this
# special case
format_is_iso8601 = format_is_iso(format)
if format_is_iso8601:
require_iso8601 = not infer_datetime_format
format = None
if format is not None:
res = _to_datetime_with_format(
arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
)
if res is not None:
return res
assert format is None or infer_datetime_format
utc = tz == "utc"
> result, tz_parsed = objects_to_datetime64ns(
arg,
dayfirst=dayfirst,
yearfirst=yearfirst,
utc=utc,
errors=errors,
require_iso8601=require_iso8601,
allow_object=True,
)
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:402:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
result = result.reshape(data.shape, order=order)
except ValueError as err:
try:
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
# If tzaware, these values represent unix timestamps, so we
# return them as i8 to distinguish from wall times
values = values.reshape(data.shape, order=order)
return values.view("i8"), tz_parsed
except (ValueError, TypeError):
> raise err
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2217:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
> result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2199:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:381:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:613:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:751:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:742:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslibs/parsing.pyx:281:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
timestr = 'now', parserinfo = None
kwargs = {'dayfirst': False, 'default': datetime.datetime(1, 1, 1, 0, 0), 'yearfirst': False}
def parse(timestr, parserinfo=None, **kwargs):
"""
Parse a string in one of the supported formats, using the
``parserinfo`` parameters.
:param timestr:
A string containing a date/time stamp.
:param parserinfo:
A :class:`parserinfo` object containing parameters for the parser.
If ``None``, the default arguments to the :class:`parserinfo`
constructor are used.
The ``**kwargs`` parameter takes the following keyword arguments:
:param default:
The default datetime object, if this is a datetime object and not
``None``, elements specified in ``timestr`` replace elements in the
default object.
:param ignoretz:
If set ``True``, time zones in parsed strings are ignored and a naive
:class:`datetime` object is returned.
:param tzinfos:
Additional time zone names / aliases which may be present in the
string. This argument maps time zone names (and optionally offsets
from those time zones) to time zones. This parameter can be a
dictionary with timezone aliases mapping time zone names to time
zones or a function taking two parameters (``tzname`` and
``tzoffset``) and returning a time zone.
The timezones to which the names are mapped can be an integer
offset from UTC in seconds or a :class:`tzinfo` object.
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> from dateutil.parser import parse
>>> from dateutil.tz import gettz
>>> tzinfos = {"BRST": -7200, "CST": gettz("America/Chicago")}
>>> parse("2012-01-19 17:21:00 BRST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21, tzinfo=tzoffset(u'BRST', -7200))
>>> parse("2012-01-19 17:21:00 CST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21,
tzinfo=tzfile('/usr/share/zoneinfo/America/Chicago'))
This parameter is ignored if ``ignoretz`` is set.
:param dayfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the day (``True``) or month (``False``). If
``yearfirst`` is set to ``True``, this distinguishes between YDM and
YMD. If set to ``None``, this value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param yearfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the year. If ``True``, the first number is taken to
be the year, otherwise the last number is taken to be the year. If
this is set to ``None``, the value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param fuzzy:
Whether to allow fuzzy parsing, allowing for string like "Today is
January 1, 2047 at 8:21:00AM".
:param fuzzy_with_tokens:
If ``True``, ``fuzzy`` is automatically set to True, and the parser
will return a tuple where the first element is the parsed
:class:`datetime.datetime` datetimestamp and the second element is
a tuple containing the portions of the string which were ignored:
.. doctest::
>>> from dateutil.parser import parse
>>> parse("Today is January 1, 2047 at 8:21:00AM", fuzzy_with_tokens=True)
(datetime.datetime(2047, 1, 1, 8, 21), (u'Today is ', u' ', u'at '))
:return:
Returns a :class:`datetime.datetime` object or, if the
``fuzzy_with_tokens`` option is ``True``, returns a tuple, the
first element being a :class:`datetime.datetime` object, the second
a tuple containing the fuzzy tokens.
:raises ParserError:
Raised for invalid or unknown string formats, if the provided
:class:`tzinfo` is not in a valid format, or if an invalid date would
be created.
:raises OverflowError:
Raised if the parsed date exceeds the largest valid C integer on
your system.
"""
if parserinfo:
return parser(parserinfo).parse(timestr, **kwargs)
else:
> return DEFAULTPARSER.parse(timestr, **kwargs)
../conda/envs/dask-3.9/lib/python3.9/site-packages/dateutil/parser/_parser.py:1368:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <dateutil.parser._parser.parser object at 0x7f59c25b9550>, timestr = 'now'
default = datetime.datetime(1, 1, 1, 0, 0), ignoretz = False, tzinfos = None
kwargs = {'dayfirst': False, 'yearfirst': False}, res = None, skipped_tokens = None
def parse(self, timestr, default=None,
ignoretz=False, tzinfos=None, **kwargs):
"""
Parse the date/time string into a :class:`datetime.datetime` object.
:param timestr:
Any date/time string using the supported formats.
:param default:
The default datetime object, if this is a datetime object and not
``None``, elements specified in ``timestr`` replace elements in the
default object.
:param ignoretz:
If set ``True``, time zones in parsed strings are ignored and a
naive :class:`datetime.datetime` object is returned.
:param tzinfos:
Additional time zone names / aliases which may be present in the
string. This argument maps time zone names (and optionally offsets
from those time zones) to time zones. This parameter can be a
dictionary with timezone aliases mapping time zone names to time
zones or a function taking two parameters (``tzname`` and
``tzoffset``) and returning a time zone.
The timezones to which the names are mapped can be an integer
offset from UTC in seconds or a :class:`tzinfo` object.
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> from dateutil.parser import parse
>>> from dateutil.tz import gettz
>>> tzinfos = {"BRST": -7200, "CST": gettz("America/Chicago")}
>>> parse("2012-01-19 17:21:00 BRST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21, tzinfo=tzoffset(u'BRST', -7200))
>>> parse("2012-01-19 17:21:00 CST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21,
tzinfo=tzfile('/usr/share/zoneinfo/America/Chicago'))
This parameter is ignored if ``ignoretz`` is set.
:param \\*\\*kwargs:
Keyword arguments as passed to ``_parse()``.
:return:
Returns a :class:`datetime.datetime` object or, if the
``fuzzy_with_tokens`` option is ``True``, returns a tuple, the
first element being a :class:`datetime.datetime` object, the second
a tuple containing the fuzzy tokens.
:raises ParserError:
Raised for invalid or unknown string format, if the provided
:class:`tzinfo` is not in a valid format, or if an invalid date
would be created.
:raises TypeError:
Raised for non-string or character stream input.
:raises OverflowError:
Raised if the parsed date exceeds the largest valid C integer on
your system.
"""
if default is None:
default = datetime.datetime.now().replace(hour=0, minute=0,
second=0, microsecond=0)
res, skipped_tokens = self._parse(timestr, **kwargs)
if res is None:
> raise ParserError("Unknown string format: %s", timestr)
E dateutil.parser._parser.ParserError: Unknown string format: now
../conda/envs/dask-3.9/lib/python3.9/site-packages/dateutil/parser/_parser.py:643: ParserError
The above exception was the direct cause of the following exception:
tmpdir = local('/tmp/pytest-of-julia/pytest-132/test_roundtrip_fastparquet_df10')
df = x
index
0 1970-01-01 00:00:00.000003+00:00
1 1970-01-01 00:00:00.000002+00:00
2 1970-01-01 00:00:00.000001+00:00
write_kwargs = {}, read_kwargs = {}, engine = 'fastparquet'
@pytest.mark.parametrize(
"df,write_kwargs,read_kwargs",
[
(pd.DataFrame({"x": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": ["c", "a", "b"]}), {}, {}),
(pd.DataFrame({"x": ["cc", "a", "bbb"]}), {}, {}),
(pd.DataFrame({"x": [b"a", b"b", b"c"]}), {"object_encoding": "bytes"}, {}),
(
pd.DataFrame({"x": pd.Categorical(["a", "b", "a"])}),
{},
{"categories": ["x"]},
),
(pd.DataFrame({"x": pd.Categorical([1, 2, 1])}), {}, {"categories": ["x"]}),
(pd.DataFrame({"x": list(map(pd.Timestamp, [3000, 2000, 1000]))}), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("M8[ns]"), {}, {}),
pytest.param(
pd.DataFrame({"x": [3, 2, 1]}).astype("M8[ns]"),
{},
{},
),
(pd.DataFrame({"x": [3, 2, 1]}).astype("M8[us]"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("M8[ms]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns, UTC]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns, CET]"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("uint16"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("float32"), {}, {}),
(pd.DataFrame({"x": [3, 1, 2]}, index=[3, 2, 1]), {}, {}),
(pd.DataFrame({"x": [3, 1, 5]}, index=pd.Index([1, 2, 3], name="foo")), {}, {}),
(pd.DataFrame({"x": [1, 2, 3], "y": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": [1, 2, 3], "y": [3, 2, 1]}, columns=["y", "x"]), {}, {}),
(pd.DataFrame({"0": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": [3, 2, None]}), {}, {}),
(pd.DataFrame({"-": [3.0, 2.0, None]}), {}, {}),
(pd.DataFrame({".": [3.0, 2.0, None]}), {}, {}),
(pd.DataFrame({" ": [3.0, 2.0, None]}), {}, {}),
],
)
def test_roundtrip(tmpdir, df, write_kwargs, read_kwargs, engine):
if "x" in df and df.x.dtype == "M8[ns]" and "arrow" in engine:
pytest.xfail(reason="Parquet pyarrow v1 doesn't support nanosecond precision")
if (
"x" in df
and df.x.dtype == "M8[ns]"
and engine == "fastparquet"
and fastparquet_version <= parse_version("0.6.3")
):
pytest.xfail(reason="fastparquet doesn't support nanosecond precision yet")
if (
PANDAS_GT_130
and read_kwargs.get("categories", None)
and engine == "fastparquet"
and fastparquet_version <= parse_version("0.6.3")
):
pytest.xfail("https://github.com/dask/fastparquet/issues/577")
tmp = str(tmpdir)
if df.index.name is None:
df.index.name = "index"
ddf = dd.from_pandas(df, npartitions=2)
oe = write_kwargs.pop("object_encoding", None)
if oe and engine == "fastparquet":
dd.to_parquet(ddf, tmp, engine=engine, object_encoding=oe, **write_kwargs)
else:
> dd.to_parquet(ddf, tmp, engine=engine, **write_kwargs)
dask/dataframe/io/tests/test_parquet.py:988:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/dataframe/io/parquet/core.py:701: in to_parquet
meta, schema, i_offset = engine.initialize_write(
dask/dataframe/io/parquet/fastparquet.py:1206: in initialize_write
fmd = fastparquet.writer.make_metadata(
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:740: in make_metadata
get_column_metadata(data[column], column))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
column = Series([], Name: x, dtype: datetime64[ns, UTC]), name = 'x'
def get_column_metadata(column, name):
"""Produce pandas column metadata block"""
# from pyarrow.pandas_compat
# https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py
inferred_dtype = infer_dtype(column)
dtype = column.dtype
if str(dtype) == "bool":
# pandas accidentally calls this "boolean"
inferred_dtype = "bool"
if is_categorical_dtype(dtype):
extra_metadata = {
'num_categories': len(column.cat.categories),
'ordered': column.cat.ordered,
}
dtype = column.cat.codes.dtype
elif hasattr(dtype, 'tz'):
try:
stz = str(dtype.tz)
if "UTC" in stz and ":" in stz:
extra_metadata = {'timezone': stz.strip("UTC")}
elif "pytz" not in stz:
pd.Series([pd.to_datetime('now')]).dt.tz_localize(stz)
extra_metadata = {'timezone': str(dtype.tz)}
elif "Offset" in stz:
import pytz
extra_metadata = {'timezone': f"{dtype.tz._minutes // 60:+03}:00"}
else:
raise KeyError
except Exception as e:
> raise ValueError("Time-zone information could not be serialised: "
"%s, please use another" % str(dtype.tz)) from e
E ValueError: Time-zone information could not be serialised: UTC, please use another
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/util.py:322: ValueError
---------------------------------------------- Captured stderr call -----------------------------------------------
FutureWarning: The parsing of 'now' in pd.to_datetime without `utc=True` is deprecated. In a future version, this will match Timestamp('now') and Timestamp.now()
__________________________ test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] __________________________
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
result = result.reshape(data.shape, order=order)
except ValueError as err:
try:
> values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2211:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: Unrecognized value type: <class 'str'>
pandas/_libs/tslibs/conversion.pyx:360: TypeError
During handling of the above exception, another exception occurred:
column = Series([], Name: x, dtype: datetime64[ns, CET]), name = 'x'
def get_column_metadata(column, name):
"""Produce pandas column metadata block"""
# from pyarrow.pandas_compat
# https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py
inferred_dtype = infer_dtype(column)
dtype = column.dtype
if str(dtype) == "bool":
# pandas accidentally calls this "boolean"
inferred_dtype = "bool"
if is_categorical_dtype(dtype):
extra_metadata = {
'num_categories': len(column.cat.categories),
'ordered': column.cat.ordered,
}
dtype = column.cat.codes.dtype
elif hasattr(dtype, 'tz'):
try:
stz = str(dtype.tz)
if "UTC" in stz and ":" in stz:
extra_metadata = {'timezone': stz.strip("UTC")}
elif "pytz" not in stz:
> pd.Series([pd.to_datetime('now')]).dt.tz_localize(stz)
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/util.py:314:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = 'now', errors = 'raise', dayfirst = False, yearfirst = False, utc = None, format = None, exact = True
unit = None, infer_datetime_format = False, origin = 'unix', cache = True
def to_datetime(
arg: DatetimeScalarOrArrayConvertible,
errors: str = "raise",
dayfirst: bool = False,
yearfirst: bool = False,
utc: bool | None = None,
format: str | None = None,
exact: bool = True,
unit: str | None = None,
infer_datetime_format: bool = False,
origin="unix",
cache: bool = True,
) -> DatetimeIndex | Series | DatetimeScalar | NaTType | None:
"""
Convert argument to datetime.
This function converts a scalar, array-like, :class:`Series` or
:class:`DataFrame`/dict-like to a pandas datetime object.
Parameters
----------
arg : int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
The object to convert to a datetime. If a :class:`DataFrame` is provided, the
method expects minimally the following columns: :const:`"year"`,
:const:`"month"`, :const:`"day"`.
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
- If :const:`'raise'`, then invalid parsing will raise an exception.
- If :const:`'coerce'`, then invalid parsing will be set as :const:`NaT`.
- If :const:`'ignore'`, then invalid parsing will return the input.
dayfirst : bool, default False
Specify a date parse order if `arg` is str or is list-like.
If :const:`True`, parses dates with the day first, e.g. :const:`"10/11/12"`
is parsed as :const:`2012-11-10`.
.. warning::
``dayfirst=True`` is not strict, but will prefer to parse
with day first. If a delimited date string cannot be parsed in
accordance with the given `dayfirst` option, e.g.
``to_datetime(['31-12-2021'])``, then a warning will be shown.
yearfirst : bool, default False
Specify a date parse order if `arg` is str or is list-like.
- If :const:`True` parses dates with the year first, e.g.
:const:`"10/11/12"` is parsed as :const:`2010-11-12`.
- If both `dayfirst` and `yearfirst` are :const:`True`, `yearfirst` is
preceded (same as :mod:`dateutil`).
.. warning::
``yearfirst=True`` is not strict, but will prefer to parse
with year first.
utc : bool, default None
Control timezone-related parsing, localization and conversion.
- If :const:`True`, the function *always* returns a timezone-aware
UTC-localized :class:`Timestamp`, :class:`Series` or
:class:`DatetimeIndex`. To do this, timezone-naive inputs are
*localized* as UTC, while timezone-aware inputs are *converted* to UTC.
- If :const:`False` (default), inputs will not be coerced to UTC.
Timezone-naive inputs will remain naive, while timezone-aware ones
will keep their time offsets. Limitations exist for mixed
offsets (typically, daylight savings), see :ref:`Examples
<to_datetime_tz_examples>` section for details.
See also: pandas general documentation about `timezone conversion and
localization
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#time-zone-handling>`_.
format : str, default None
The strftime to parse time, e.g. :const:`"%d/%m/%Y"`. Note that
:const:`"%f"` will parse all the way up to nanoseconds. See
`strftime documentation
<https://docs.python.org/3/library/datetime.html
#strftime-and-strptime-behavior>`_ for more information on choices.
exact : bool, default True
Control how `format` is used:
- If :const:`True`, require an exact `format` match.
- If :const:`False`, allow the `format` to match anywhere in the target
string.
unit : str, default 'ns'
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
integer or float number. This will be based off the origin.
Example, with ``unit='ms'`` and ``origin='unix'`` (the default), this
would calculate the number of milliseconds to the unix epoch start.
infer_datetime_format : bool, default False
If :const:`True` and no `format` is given, attempt to infer the format
of the datetime strings based on the first non-NaN element,
and if it can be inferred, switch to a faster method of parsing them.
In some cases this can increase the parsing speed by ~5-10x.
origin : scalar, default 'unix'
Define the reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
- If :const:`'unix'` (or POSIX) time; origin is set to 1970-01-01.
- If :const:`'julian'`, unit must be :const:`'D'`, and origin is set to
beginning of Julian Calendar. Julian day number :const:`0` is assigned
to the day starting at noon on January 1, 4713 BC.
- If Timestamp convertible, origin is set to Timestamp identified by
origin.
cache : bool, default True
If :const:`True`, use a cache of unique, converted dates to apply the
datetime conversion. May produce significant speed-up when parsing
duplicate date strings, especially ones with timezone offsets. The cache
is only used when there are at least 50 values. The presence of
out-of-bounds values will render the cache unusable and may slow down
parsing.
.. versionchanged:: 0.25.0
changed default value from :const:`False` to :const:`True`.
Returns
-------
datetime
If parsing succeeded.
Return type depends on input (types in parenthesis correspond to
fallback in case of unsuccessful timezone or out-of-range timestamp
parsing):
- scalar: :class:`Timestamp` (or :class:`datetime.datetime`)
- array-like: :class:`DatetimeIndex` (or :class:`Series` with
:class:`object` dtype containing :class:`datetime.datetime`)
- Series: :class:`Series` of :class:`datetime64` dtype (or
:class:`Series` of :class:`object` dtype containing
:class:`datetime.datetime`)
- DataFrame: :class:`Series` of :class:`datetime64` dtype (or
:class:`Series` of :class:`object` dtype containing
:class:`datetime.datetime`)
Raises
------
ParserError
When parsing a date from string fails.
ValueError
When another datetime conversion error happens. For example when one
of 'year', 'month', day' columns is missing in a :class:`DataFrame`, or
when a Timezone-aware :class:`datetime.datetime` is found in an array-like
of mixed time offsets, and ``utc=False``.
See Also
--------
DataFrame.astype : Cast argument to a specified dtype.
to_timedelta : Convert argument to timedelta.
convert_dtypes : Convert dtypes.
Notes
-----
Many input types are supported, and lead to different output types:
- **scalars** can be int, float, str, datetime object (from stdlib :mod:`datetime`
module or :mod:`numpy`). They are converted to :class:`Timestamp` when
possible, otherwise they are converted to :class:`datetime.datetime`.
None/NaN/null scalars are converted to :const:`NaT`.
- **array-like** can contain int, float, str, datetime objects. They are
converted to :class:`DatetimeIndex` when possible, otherwise they are
converted to :class:`Index` with :class:`object` dtype, containing
:class:`datetime.datetime`. None/NaN/null entries are converted to
:const:`NaT` in both cases.
- **Series** are converted to :class:`Series` with :class:`datetime64`
dtype when possible, otherwise they are converted to :class:`Series` with
:class:`object` dtype, containing :class:`datetime.datetime`. None/NaN/null
entries are converted to :const:`NaT` in both cases.
- **DataFrame/dict-like** are converted to :class:`Series` with
:class:`datetime64` dtype. For each row a datetime is created from assembling
the various dataframe columns. Column keys can be common abbreviations
like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or
plurals of the same.
The following causes are responsible for :class:`datetime.datetime` objects
being returned (possibly inside an :class:`Index` or a :class:`Series` with
:class:`object` dtype) instead of a proper pandas designated type
(:class:`Timestamp`, :class:`DatetimeIndex` or :class:`Series`
with :class:`datetime64` dtype):
- when any input element is before :const:`Timestamp.min` or after
:const:`Timestamp.max`, see `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_.
- when ``utc=False`` (default) and the input is an array-like or
:class:`Series` containing mixed naive/aware datetime, or aware with mixed
time offsets. Note that this happens in the (quite frequent) situation when
the timezone has a daylight savings policy. In that case you may wish to
use ``utc=True``.
Examples
--------
**Handling various input formats**
Assembling a datetime from multiple columns of a :class:`DataFrame`. The keys
can be common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
>>> df = pd.DataFrame({'year': [2015, 2016],
... 'month': [2, 3],
... 'day': [4, 5]})
>>> pd.to_datetime(df)
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
Passing ``infer_datetime_format=True`` can often-times speedup a parsing
if its not an ISO8601 format exactly, but in a regular format.
>>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
>>> s.head()
0 3/11/2000
1 3/12/2000
2 3/13/2000
3 3/11/2000
4 3/12/2000
dtype: object
>>> %timeit pd.to_datetime(s, infer_datetime_format=True) # doctest: +SKIP
100 loops, best of 3: 10.4 ms per loop
>>> %timeit pd.to_datetime(s, infer_datetime_format=False) # doctest: +SKIP
1 loop, best of 3: 471 ms per loop
Using a unix epoch time
>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')
.. warning:: For float arg, precision rounding might happen. To prevent
unexpected behavior use a fixed-width exact type.
Using a non-unix epoch origin
>>> pd.to_datetime([1, 2, 3], unit='D',
... origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
dtype='datetime64[ns]', freq=None)
**Non-convertible date/times**
If a date does not meet the `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_, passing ``errors='ignore'``
will return the original input instead of raising any exception.
Passing ``errors='coerce'`` will force an out-of-bounds date to :const:`NaT`,
in addition to forcing non-dates (or non-parseable dates) to :const:`NaT`.
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT
.. _to_datetime_tz_examples:
**Timezones and time offsets**
The default behaviour (``utc=False``) is as follows:
- Timezone-naive inputs are converted to timezone-naive :class:`DatetimeIndex`:
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00:15'])
DatetimeIndex(['2018-10-26 12:00:00', '2018-10-26 13:00:15'],
dtype='datetime64[ns]', freq=None)
- Timezone-aware inputs *with constant time offset* are converted to
timezone-aware :class:`DatetimeIndex`:
>>> pd.to_datetime(['2018-10-26 12:00 -0500', '2018-10-26 13:00 -0500'])
DatetimeIndex(['2018-10-26 12:00:00-05:00', '2018-10-26 13:00:00-05:00'],
dtype='datetime64[ns, pytz.FixedOffset(-300)]', freq=None)
- However, timezone-aware inputs *with mixed time offsets* (for example
issued from a timezone with daylight savings, such as Europe/Paris)
are **not successfully converted** to a :class:`DatetimeIndex`. Instead a
simple :class:`Index` containing :class:`datetime.datetime` objects is
returned:
>>> pd.to_datetime(['2020-10-25 02:00 +0200', '2020-10-25 04:00 +0100'])
Index([2020-10-25 02:00:00+02:00, 2020-10-25 04:00:00+01:00],
dtype='object')
- A mix of timezone-aware and timezone-naive inputs is converted to
a timezone-aware :class:`DatetimeIndex` if the offsets of the timezone-aware
are constant:
>>> from datetime import datetime
>>> pd.to_datetime(["2020-01-01 01:00 -01:00", datetime(2020, 1, 1, 3, 0)])
DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'],
dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)
- Finally, mixing timezone-aware strings and :class:`datetime.datetime` always
raises an error, even if the elements all have the same time offset.
>>> from datetime import datetime, timezone, timedelta
>>> d = datetime(2020, 1, 1, 18, tzinfo=timezone(-timedelta(hours=1)))
>>> pd.to_datetime(["2020-01-01 17:00 -0100", d])
Traceback (most recent call last):
...
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64
unless utc=True
|
Setting ``utc=True`` solves most of the above issues:
- Timezone-naive inputs are *localized* as UTC
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
- Timezone-aware inputs are *converted* to UTC (the output represents the
exact same datetime, but viewed from the UTC time offset `+00:00`).
>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
... utc=True)
DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
- Inputs can contain both naive and aware, string or datetime, the above
rules still apply
>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 12:00 -0530',
... datetime(2020, 1, 1, 18),
... datetime(2020, 1, 1, 18,
... tzinfo=timezone(-timedelta(hours=1)))],
... utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 17:30:00+00:00',
'2020-01-01 18:00:00+00:00', '2020-01-01 19:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
"""
if arg is None:
return None
if origin != "unix":
arg = _adjust_to_origin(arg, origin, unit)
tz = "utc" if utc else None
convert_listlike = partial(
_convert_listlike_datetimes,
tz=tz,
unit=unit,
dayfirst=dayfirst,
yearfirst=yearfirst,
errors=errors,
exact=exact,
infer_datetime_format=infer_datetime_format,
)
result: Timestamp | NaTType | Series | Index
if isinstance(arg, Timestamp):
result = arg
if tz is not None:
if arg.tz is not None:
result = arg.tz_convert(tz)
else:
result = arg.tz_localize(tz)
elif isinstance(arg, ABCSeries):
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
if not cache_array.empty:
result = arg.map(cache_array)
else:
values = convert_listlike(arg._values, format)
result = arg._constructor(values, index=arg.index, name=arg.name)
elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):
result = _assemble_from_unit_mappings(arg, errors, tz)
elif isinstance(arg, Index):
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
if not cache_array.empty:
result = _convert_and_box_cache(arg, cache_array, name=arg.name)
else:
result = convert_listlike(arg, format, name=arg.name)
elif is_list_like(arg):
try:
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
except OutOfBoundsDatetime:
# caching attempts to create a DatetimeIndex, which may raise
# an OOB. If that's the desired behavior, then just reraise...
if errors == "raise":
raise
# ... otherwise, continue without the cache.
from pandas import Series
cache_array = Series([], dtype=object) # just an empty array
if not cache_array.empty:
result = _convert_and_box_cache(arg, cache_array)
else:
result = convert_listlike(arg, format)
else:
> result = convert_listlike(np.array([arg]), format)[0]
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:1078:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = array(['now'], dtype=object), format = None, name = None, tz = None, unit = None, errors = 'raise'
infer_datetime_format = False, dayfirst = False, yearfirst = False, exact = True
def _convert_listlike_datetimes(
arg,
format: str | None,
name: Hashable = None,
tz: Timezone | None = None,
unit: str | None = None,
errors: str = "raise",
infer_datetime_format: bool = False,
dayfirst: bool | None = None,
yearfirst: bool | None = None,
exact: bool = True,
):
"""
Helper function for to_datetime. Performs the conversions of 1D listlike
of dates
Parameters
----------
arg : list, tuple, ndarray, Series, Index
date to be parsed
name : object
None or string for the Index name
tz : object
None or 'utc'
unit : str
None or string of the frequency of the passed data
errors : str
error handing behaviors from to_datetime, 'raise', 'coerce', 'ignore'
infer_datetime_format : bool, default False
inferring format behavior from to_datetime
dayfirst : bool
dayfirst parsing behavior from to_datetime
yearfirst : bool
yearfirst parsing behavior from to_datetime
exact : bool, default True
exact format matching behavior from to_datetime
Returns
-------
Index-like of parsed dates
"""
if isinstance(arg, (list, tuple)):
arg = np.array(arg, dtype="O")
arg_dtype = getattr(arg, "dtype", None)
# these are shortcutable
if is_datetime64tz_dtype(arg_dtype):
if not isinstance(arg, (DatetimeArray, DatetimeIndex)):
return DatetimeIndex(arg, tz=tz, name=name)
if tz == "utc":
arg = arg.tz_convert(None).tz_localize(tz)
return arg
elif is_datetime64_ns_dtype(arg_dtype):
if not isinstance(arg, (DatetimeArray, DatetimeIndex)):
try:
return DatetimeIndex(arg, tz=tz, name=name)
except ValueError:
pass
elif tz:
# DatetimeArray, DatetimeIndex
return arg.tz_localize(tz)
return arg
elif unit is not None:
if format is not None:
raise ValueError("cannot specify both format and unit")
return _to_datetime_with_unit(arg, unit, name, tz, errors)
elif getattr(arg, "ndim", 1) > 1:
raise TypeError(
"arg must be a string, datetime, list, tuple, 1-d array, or Series"
)
# warn if passing timedelta64, raise for PeriodDtype
# NB: this must come after unit transformation
orig_arg = arg
try:
arg, _ = maybe_convert_dtype(arg, copy=False)
except TypeError:
if errors == "coerce":
npvalues = np.array(["NaT"], dtype="datetime64[ns]").repeat(len(arg))
return DatetimeIndex(npvalues, name=name)
elif errors == "ignore":
idx = Index(arg, name=name)
return idx
raise
arg = ensure_object(arg)
require_iso8601 = False
if infer_datetime_format and format is None:
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
if format is not None:
# There is a special fast-path for iso8601 formatted
# datetime strings, so in those cases don't use the inferred
# format because this path makes process slower in this
# special case
format_is_iso8601 = format_is_iso(format)
if format_is_iso8601:
require_iso8601 = not infer_datetime_format
format = None
if format is not None:
res = _to_datetime_with_format(
arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
)
if res is not None:
return res
assert format is None or infer_datetime_format
utc = tz == "utc"
> result, tz_parsed = objects_to_datetime64ns(
arg,
dayfirst=dayfirst,
yearfirst=yearfirst,
utc=utc,
errors=errors,
require_iso8601=require_iso8601,
allow_object=True,
)
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/tools/datetimes.py:402:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
result = result.reshape(data.shape, order=order)
except ValueError as err:
try:
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
# If tzaware, these values represent unix timestamps, so we
# return them as i8 to distinguish from wall times
values = values.reshape(data.shape, order=order)
return values.view("i8"), tz_parsed
except (ValueError, TypeError):
> raise err
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2217:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = array(['now'], dtype=object), dayfirst = False, yearfirst = False, utc = False, errors = 'raise'
require_iso8601 = False, allow_object = True, allow_mixed = False
def objects_to_datetime64ns(
data: np.ndarray,
dayfirst,
yearfirst,
utc=False,
errors="raise",
require_iso8601: bool = False,
allow_object: bool = False,
allow_mixed: bool = False,
):
"""
Convert data to array of timestamps.
Parameters
----------
data : np.ndarray[object]
dayfirst : bool
yearfirst : bool
utc : bool, default False
Whether to convert timezone-aware timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
allow_mixed : bool, default False
Interpret integers as timestamps when datetime objects are also present.
Returns
-------
result : ndarray
np.int64 dtype if returned values represent UTC timestamps
np.datetime64[ns] if returned values represent wall times
object if mixed timezones
inferred_tz : tzinfo or None
Raises
------
ValueError : if data cannot be converted to datetimes
"""
assert errors in ["raise", "ignore", "coerce"]
# if str-dtype, convert
data = np.array(data, copy=False, dtype=np.object_)
flags = data.flags
order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
try:
> result, tz_parsed = tslib.array_to_datetime(
data.ravel("K"),
errors=errors,
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
allow_mixed=allow_mixed,
)
../conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:2199:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:381:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:613:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:751:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslib.pyx:742:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
pandas/_libs/tslibs/parsing.pyx:281:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
timestr = 'now', parserinfo = None
kwargs = {'dayfirst': False, 'default': datetime.datetime(1, 1, 1, 0, 0), 'yearfirst': False}
def parse(timestr, parserinfo=None, **kwargs):
"""
Parse a string in one of the supported formats, using the
``parserinfo`` parameters.
:param timestr:
A string containing a date/time stamp.
:param parserinfo:
A :class:`parserinfo` object containing parameters for the parser.
If ``None``, the default arguments to the :class:`parserinfo`
constructor are used.
The ``**kwargs`` parameter takes the following keyword arguments:
:param default:
The default datetime object, if this is a datetime object and not
``None``, elements specified in ``timestr`` replace elements in the
default object.
:param ignoretz:
If set ``True``, time zones in parsed strings are ignored and a naive
:class:`datetime` object is returned.
:param tzinfos:
Additional time zone names / aliases which may be present in the
string. This argument maps time zone names (and optionally offsets
from those time zones) to time zones. This parameter can be a
dictionary with timezone aliases mapping time zone names to time
zones or a function taking two parameters (``tzname`` and
``tzoffset``) and returning a time zone.
The timezones to which the names are mapped can be an integer
offset from UTC in seconds or a :class:`tzinfo` object.
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> from dateutil.parser import parse
>>> from dateutil.tz import gettz
>>> tzinfos = {"BRST": -7200, "CST": gettz("America/Chicago")}
>>> parse("2012-01-19 17:21:00 BRST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21, tzinfo=tzoffset(u'BRST', -7200))
>>> parse("2012-01-19 17:21:00 CST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21,
tzinfo=tzfile('/usr/share/zoneinfo/America/Chicago'))
This parameter is ignored if ``ignoretz`` is set.
:param dayfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the day (``True``) or month (``False``). If
``yearfirst`` is set to ``True``, this distinguishes between YDM and
YMD. If set to ``None``, this value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param yearfirst:
Whether to interpret the first value in an ambiguous 3-integer date
(e.g. 01/05/09) as the year. If ``True``, the first number is taken to
be the year, otherwise the last number is taken to be the year. If
this is set to ``None``, the value is retrieved from the current
:class:`parserinfo` object (which itself defaults to ``False``).
:param fuzzy:
Whether to allow fuzzy parsing, allowing for string like "Today is
January 1, 2047 at 8:21:00AM".
:param fuzzy_with_tokens:
If ``True``, ``fuzzy`` is automatically set to True, and the parser
will return a tuple where the first element is the parsed
:class:`datetime.datetime` datetimestamp and the second element is
a tuple containing the portions of the string which were ignored:
.. doctest::
>>> from dateutil.parser import parse
>>> parse("Today is January 1, 2047 at 8:21:00AM", fuzzy_with_tokens=True)
(datetime.datetime(2047, 1, 1, 8, 21), (u'Today is ', u' ', u'at '))
:return:
Returns a :class:`datetime.datetime` object or, if the
``fuzzy_with_tokens`` option is ``True``, returns a tuple, the
first element being a :class:`datetime.datetime` object, the second
a tuple containing the fuzzy tokens.
:raises ParserError:
Raised for invalid or unknown string formats, if the provided
:class:`tzinfo` is not in a valid format, or if an invalid date would
be created.
:raises OverflowError:
Raised if the parsed date exceeds the largest valid C integer on
your system.
"""
if parserinfo:
return parser(parserinfo).parse(timestr, **kwargs)
else:
> return DEFAULTPARSER.parse(timestr, **kwargs)
../conda/envs/dask-3.9/lib/python3.9/site-packages/dateutil/parser/_parser.py:1368:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <dateutil.parser._parser.parser object at 0x7f59c25b9550>, timestr = 'now'
default = datetime.datetime(1, 1, 1, 0, 0), ignoretz = False, tzinfos = None
kwargs = {'dayfirst': False, 'yearfirst': False}, res = None, skipped_tokens = None
def parse(self, timestr, default=None,
ignoretz=False, tzinfos=None, **kwargs):
"""
Parse the date/time string into a :class:`datetime.datetime` object.
:param timestr:
Any date/time string using the supported formats.
:param default:
The default datetime object, if this is a datetime object and not
``None``, elements specified in ``timestr`` replace elements in the
default object.
:param ignoretz:
If set ``True``, time zones in parsed strings are ignored and a
naive :class:`datetime.datetime` object is returned.
:param tzinfos:
Additional time zone names / aliases which may be present in the
string. This argument maps time zone names (and optionally offsets
from those time zones) to time zones. This parameter can be a
dictionary with timezone aliases mapping time zone names to time
zones or a function taking two parameters (``tzname`` and
``tzoffset``) and returning a time zone.
The timezones to which the names are mapped can be an integer
offset from UTC in seconds or a :class:`tzinfo` object.
.. doctest::
:options: +NORMALIZE_WHITESPACE
>>> from dateutil.parser import parse
>>> from dateutil.tz import gettz
>>> tzinfos = {"BRST": -7200, "CST": gettz("America/Chicago")}
>>> parse("2012-01-19 17:21:00 BRST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21, tzinfo=tzoffset(u'BRST', -7200))
>>> parse("2012-01-19 17:21:00 CST", tzinfos=tzinfos)
datetime.datetime(2012, 1, 19, 17, 21,
tzinfo=tzfile('/usr/share/zoneinfo/America/Chicago'))
This parameter is ignored if ``ignoretz`` is set.
:param \\*\\*kwargs:
Keyword arguments as passed to ``_parse()``.
:return:
Returns a :class:`datetime.datetime` object or, if the
``fuzzy_with_tokens`` option is ``True``, returns a tuple, the
first element being a :class:`datetime.datetime` object, the second
a tuple containing the fuzzy tokens.
:raises ParserError:
Raised for invalid or unknown string format, if the provided
:class:`tzinfo` is not in a valid format, or if an invalid date
would be created.
:raises TypeError:
Raised for non-string or character stream input.
:raises OverflowError:
Raised if the parsed date exceeds the largest valid C integer on
your system.
"""
if default is None:
default = datetime.datetime.now().replace(hour=0, minute=0,
second=0, microsecond=0)
res, skipped_tokens = self._parse(timestr, **kwargs)
if res is None:
> raise ParserError("Unknown string format: %s", timestr)
E dateutil.parser._parser.ParserError: Unknown string format: now
../conda/envs/dask-3.9/lib/python3.9/site-packages/dateutil/parser/_parser.py:643: ParserError
The above exception was the direct cause of the following exception:
tmpdir = local('/tmp/pytest-of-julia/pytest-132/test_roundtrip_fastparquet_df11')
df = x
index
0 1970-01-01 01:00:00.000003+01:00
1 1970-01-01 01:00:00.000002+01:00
2 1970-01-01 01:00:00.000001+01:00
write_kwargs = {}, read_kwargs = {}, engine = 'fastparquet'
@pytest.mark.parametrize(
"df,write_kwargs,read_kwargs",
[
(pd.DataFrame({"x": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": ["c", "a", "b"]}), {}, {}),
(pd.DataFrame({"x": ["cc", "a", "bbb"]}), {}, {}),
(pd.DataFrame({"x": [b"a", b"b", b"c"]}), {"object_encoding": "bytes"}, {}),
(
pd.DataFrame({"x": pd.Categorical(["a", "b", "a"])}),
{},
{"categories": ["x"]},
),
(pd.DataFrame({"x": pd.Categorical([1, 2, 1])}), {}, {"categories": ["x"]}),
(pd.DataFrame({"x": list(map(pd.Timestamp, [3000, 2000, 1000]))}), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("M8[ns]"), {}, {}),
pytest.param(
pd.DataFrame({"x": [3, 2, 1]}).astype("M8[ns]"),
{},
{},
),
(pd.DataFrame({"x": [3, 2, 1]}).astype("M8[us]"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("M8[ms]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns, UTC]"), {}, {}),
(pd.DataFrame({"x": [3000, 2000, 1000]}).astype("datetime64[ns, CET]"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("uint16"), {}, {}),
(pd.DataFrame({"x": [3, 2, 1]}).astype("float32"), {}, {}),
(pd.DataFrame({"x": [3, 1, 2]}, index=[3, 2, 1]), {}, {}),
(pd.DataFrame({"x": [3, 1, 5]}, index=pd.Index([1, 2, 3], name="foo")), {}, {}),
(pd.DataFrame({"x": [1, 2, 3], "y": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": [1, 2, 3], "y": [3, 2, 1]}, columns=["y", "x"]), {}, {}),
(pd.DataFrame({"0": [3, 2, 1]}), {}, {}),
(pd.DataFrame({"x": [3, 2, None]}), {}, {}),
(pd.DataFrame({"-": [3.0, 2.0, None]}), {}, {}),
(pd.DataFrame({".": [3.0, 2.0, None]}), {}, {}),
(pd.DataFrame({" ": [3.0, 2.0, None]}), {}, {}),
],
)
def test_roundtrip(tmpdir, df, write_kwargs, read_kwargs, engine):
if "x" in df and df.x.dtype == "M8[ns]" and "arrow" in engine:
pytest.xfail(reason="Parquet pyarrow v1 doesn't support nanosecond precision")
if (
"x" in df
and df.x.dtype == "M8[ns]"
and engine == "fastparquet"
and fastparquet_version <= parse_version("0.6.3")
):
pytest.xfail(reason="fastparquet doesn't support nanosecond precision yet")
if (
PANDAS_GT_130
and read_kwargs.get("categories", None)
and engine == "fastparquet"
and fastparquet_version <= parse_version("0.6.3")
):
pytest.xfail("https://github.com/dask/fastparquet/issues/577")
tmp = str(tmpdir)
if df.index.name is None:
df.index.name = "index"
ddf = dd.from_pandas(df, npartitions=2)
oe = write_kwargs.pop("object_encoding", None)
if oe and engine == "fastparquet":
dd.to_parquet(ddf, tmp, engine=engine, object_encoding=oe, **write_kwargs)
else:
> dd.to_parquet(ddf, tmp, engine=engine, **write_kwargs)
dask/dataframe/io/tests/test_parquet.py:988:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/dataframe/io/parquet/core.py:701: in to_parquet
meta, schema, i_offset = engine.initialize_write(
dask/dataframe/io/parquet/fastparquet.py:1206: in initialize_write
fmd = fastparquet.writer.make_metadata(
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:740: in make_metadata
get_column_metadata(data[column], column))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
column = Series([], Name: x, dtype: datetime64[ns, CET]), name = 'x'
def get_column_metadata(column, name):
"""Produce pandas column metadata block"""
# from pyarrow.pandas_compat
# https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py
inferred_dtype = infer_dtype(column)
dtype = column.dtype
if str(dtype) == "bool":
# pandas accidentally calls this "boolean"
inferred_dtype = "bool"
if is_categorical_dtype(dtype):
extra_metadata = {
'num_categories': len(column.cat.categories),
'ordered': column.cat.ordered,
}
dtype = column.cat.codes.dtype
elif hasattr(dtype, 'tz'):
try:
stz = str(dtype.tz)
if "UTC" in stz and ":" in stz:
extra_metadata = {'timezone': stz.strip("UTC")}
elif "pytz" not in stz:
pd.Series([pd.to_datetime('now')]).dt.tz_localize(stz)
extra_metadata = {'timezone': str(dtype.tz)}
elif "Offset" in stz:
import pytz
extra_metadata = {'timezone': f"{dtype.tz._minutes // 60:+03}:00"}
else:
raise KeyError
except Exception as e:
> raise ValueError("Time-zone information could not be serialised: "
"%s, please use another" % str(dtype.tz)) from e
E ValueError: Time-zone information could not be serialised: CET, please use another
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/util.py:322: ValueError
---------------------------------------------- Captured stderr call -----------------------------------------------
FutureWarning: The parsing of 'now' in pd.to_datetime without `utc=True` is deprecated. In a future version, this will match Timestamp('now') and Timestamp.now()
________________________________________________ test_timestamp96 _________________________________________________
data = 0 1643143105000000000
Name: a, dtype: object
se = <class 'fastparquet.parquet_thrift.parquet.ttypes.SchemaElement'>
converted_type: 0
field_id: None
logicalType: None
name: a
num_children: None
precision: None
repetition_type: 1
scale: None
type: 6
type_length: None
def convert(data, se):
"""Convert data according to the schema encoding"""
dtype = data.dtype
type = se.type
converted_type = se.converted_type
if dtype.name in typemap:
if type in revmap:
out = data.values.astype(revmap[type], copy=False)
elif type == parquet_thrift.Type.BOOLEAN:
# TODO: with our own bitpack writer, no need to copy for
# the padding
padded = np.lib.pad(data.values, (0, 8 - (len(data) % 8)),
'constant', constant_values=(0, 0))
out = np.packbits(padded.reshape(-1, 8)[:, ::-1].ravel())
elif dtype.name in typemap:
out = data.values
elif "S" in str(dtype)[:2] or "U" in str(dtype)[:2]:
out = data.values
elif dtype == "O":
# TODO: nullable types
try:
if converted_type == parquet_thrift.ConvertedType.UTF8:
# getattr for new pandas StringArray
# TODO: to bytes in one step
> out = array_encode_utf8(data)
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:245:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: bad argument type for built-in operation
fastparquet/speedups.pyx:50: TypeError
During handling of the above exception, another exception occurred:
tmpdir = local('/tmp/pytest-of-julia/pytest-132/test_timestamp960')
@FASTPARQUET_MARK
def test_timestamp96(tmpdir):
fn = str(tmpdir)
df = pd.DataFrame({"a": ["now"]}, dtype="M8[ns]")
ddf = dd.from_pandas(df, 1)
> ddf.to_parquet(fn, write_index=False, times="int96")
dask/dataframe/io/tests/test_parquet.py:1644:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask/dataframe/core.py:4736: in to_parquet
return to_parquet(self, path, *args, **kwargs)
dask/dataframe/io/parquet/core.py:784: in to_parquet
return compute_as_if_collection(
dask/base.py:315: in compute_as_if_collection
return schedule(dsk2, keys, **kwargs)
dask/threaded.py:79: in get
results = get_async(
dask/local.py:507: in get_async
raise_exception(exc, tb)
dask/local.py:315: in reraise
raise exc
dask/local.py:220: in execute_task
result = _execute_task(task, data)
dask/core.py:119: in _execute_task
return func(*(_execute_task(a, cache) for a in args))
dask/utils.py:40: in apply
return func(*args, **kwargs)
dask/dataframe/io/parquet/fastparquet.py:1279: in write_partition
rg = make_part_file(
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:669: in make_part_file
rg = make_row_group(f, data, schema, compression=compression,
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:655: in make_row_group
chunk = write_column(f, data[column.name], column,
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:533: in write_column
repetition_data, definition_data, encode[encoding](data, selement), 8 * b'\x00'
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:340: in encode_plain
out = convert(data, se)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = 0 1643143105000000000
Name: a, dtype: object
se = <class 'fastparquet.parquet_thrift.parquet.ttypes.SchemaElement'>
converted_type: 0
field_id: None
logicalType: None
name: a
num_children: None
precision: None
repetition_type: 1
scale: None
type: 6
type_length: None
def convert(data, se):
"""Convert data according to the schema encoding"""
dtype = data.dtype
type = se.type
converted_type = se.converted_type
if dtype.name in typemap:
if type in revmap:
out = data.values.astype(revmap[type], copy=False)
elif type == parquet_thrift.Type.BOOLEAN:
# TODO: with our own bitpack writer, no need to copy for
# the padding
padded = np.lib.pad(data.values, (0, 8 - (len(data) % 8)),
'constant', constant_values=(0, 0))
out = np.packbits(padded.reshape(-1, 8)[:, ::-1].ravel())
elif dtype.name in typemap:
out = data.values
elif "S" in str(dtype)[:2] or "U" in str(dtype)[:2]:
out = data.values
elif dtype == "O":
# TODO: nullable types
try:
if converted_type == parquet_thrift.ConvertedType.UTF8:
# getattr for new pandas StringArray
# TODO: to bytes in one step
out = array_encode_utf8(data)
elif converted_type == parquet_thrift.ConvertedType.DECIMAL:
out = data.values.astype(np.float64, copy=False)
elif converted_type is None:
if type in revmap:
out = data.values.astype(revmap[type], copy=False)
elif type == parquet_thrift.Type.BOOLEAN:
# TODO: with our own bitpack writer, no need to copy for
# the padding
padded = np.lib.pad(data.values, (0, 8 - (len(data) % 8)),
'constant', constant_values=(0, 0))
out = np.packbits(padded.reshape(-1, 8)[:, ::-1].ravel())
else:
out = data.values
elif converted_type == parquet_thrift.ConvertedType.JSON:
# TODO: avoid list, use better JSON
out = np.array([json.dumps(x).encode('utf8') for x in data],
dtype="O")
elif converted_type == parquet_thrift.ConvertedType.BSON:
out = data.map(tobson).values
if type == parquet_thrift.Type.FIXED_LEN_BYTE_ARRAY:
out = out.astype('S%i' % se.type_length)
except Exception as e:
ct = parquet_thrift.ConvertedType._VALUES_TO_NAMES[
converted_type] if converted_type is not None else None
> raise ValueError('Error converting column "%s" to bytes using '
'encoding %s. Original error: '
E ValueError: Error converting column "a" to bytes using encoding UTF8. Original error: bad argument type for built-in operation
../conda/envs/dask-3.9/lib/python3.9/site-packages/fastparquet/writer.py:270: ValueError
---------------------------------------------- Captured stderr call -----------------------------------------------
FutureWarning: The parsing of 'now' in pd.to_datetime without `utc=True` is deprecated. In a future version, this will match Timestamp('now') and Timestamp.now()
================================================ warnings summary =================================================
dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12]
dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13]
dask/dataframe/io/tests/test_parquet.py::test_timestamp96
/home/julia/conda/envs/dask-3.9/lib/python3.9/site-packages/_pytest/unraisableexception.py:78: PytestUnraisableExceptionWarning: Exception ignored in: 'pandas._libs.tslib._parse_today_now'
Traceback (most recent call last):
File "/home/julia/conda/envs/dask-3.9/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py", line 2199, in objects_to_datetime64ns
result, tz_parsed = tslib.array_to_datetime(
FutureWarning: The parsing of 'now' in pd.to_datetime without `utc=True` is deprecated. In a future version, this will match Timestamp('now') and Timestamp.now()
warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))
-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================== slowest 10 durations ===============================================
0.01s call dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12]
0.00s call dask/dataframe/io/tests/test_parquet.py::test_timestamp96
0.00s call dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13]
0.00s setup dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12]
0.00s setup dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13]
0.00s setup dask/dataframe/io/tests/test_parquet.py::test_timestamp96
0.00s teardown dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13]
0.00s teardown dask/dataframe/io/tests/test_parquet.py::test_timestamp96
0.00s teardown dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12]
============================================= short test summary info =============================================
FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12] - ...
FAILED dask/dataframe/io/tests/test_parquet.py::test_roundtrip[fastparquet-df13-write_kwargs13-read_kwargs13] - ...
FAILED dask/dataframe/io/tests/test_parquet.py::test_timestamp96 - ValueError: Error converting column "a" to by...
================================== 3 failed, 873 deselected, 3 warnings in 1.31s ================================== |
PRs left to merge:
|
I don't know about these yet, but fastparquet tests are failing too because of Int64Index -> Index(Int64) ( supposed to be marked as depr, but seems the default behaviour changed too https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#deprecated-int64index-uint64index-float64index ). |
There shouldn't be a change in behaviour (yet), the code snippet in what you linked is "future behaviour", not what is released (that's a bit confusing, as we typically show "new behaviour" in those whatsnew code snippets, i.e. for the new release) |
@jorisvandenbossche specifically, it was the index type of a read_csv call that changed (for the fastparquet breakage), and pandas refuses to compare two series with different index types, although the values are the same. |
^ the good news is that now, only the same three dask test suite failures are seen in fastparquet's CI. |
Ah, that's because you are using extension types, that indeed changed: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays (but so for default Int64Index, nothing yet changed) |
There are already some issues / PRs related to this (eg some upstream failures are also listed in ), but I thought it might be useful to have a dedicated issue about this.
Pandas released a 1.4.0rc0 last week, and probably will release a final 1.4.0 next week.
The upstream test build (eg https://github.com/dask/dask/runs/4823351165?check_suite_focus=true) shows several failures, falling into a few categories:
DataFrame.append
is deprecated, and so the warnings are turned into errors. Deprecate append when pandas >= 1.4.0 #8617test_roundtrip[fastparquet-df12-write_kwargs12-read_kwargs12]
is failing)test_parquet.py::test_getitem_optimization
failures (TypeError: Column name must be a string. Got column None of type NoneType) -> this might also be related to the to_frame failure below Fixto_frame
name
to not pass None #8554dask/dataframe/tests/test_indexing.py::test_to_frame
(the result has a "NaN" column name instead of expected default of "0") Fixto_frame
name
to not pass None #8554test_cumulative
) Fixto_frame
name
to not pass None #8554test_groupby_column_and_index_apply
) Fixaxis=None
warning #8555__array_wrap__
.) caused by pandas removing__array_wrap__
-> this might already be covered by Returnda.Array
rather thandd.Series
for non-ufunc elementwise functions ondd.Series
#8558 ?The text was updated successfully, but these errors were encountered: