Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5855: [Python] Support for Duration (timedelta) type #5566

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 2, 2019

https://issues.apache.org/jira/browse/ARROW-5855

  • Basic wrappers for DurationArray/Type/Value
  • Numpy / pandas conversion
  • Python conversion

@github-actions
Copy link

github-actions bot commented Oct 2, 2019

@@ -449,6 +449,63 @@ class TimestampConverter
TimeUnit::type unit_;
};

template <NullCoding null_coding>
class DurationConverter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost literally a copy paste of the TimestampConverter above with timestamp-things replaced with duration-things. Is there a way to deduplicate some code?
Templating this might be difficult as I also need to switch Python / numpy C API calls? (PyDateTime_DateTime -> PyDateTime_Delta, PyDatetimeScalarObject -> PyTimedeltaScalarObject, ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use something like this:

template <typename ArrowType>
struct PyDateTimeTraits {};

template <>
struct PyDateTimeTraits<TimestampType> {
  using PyTypeCheck = PyDateTime_Check;
  using PyDateTimeObject = PyDateTime_DateTime;

  using py_to_s = internal::PyDateTime_to_s;
  using py_to_ms = internal::PyDateTime_to_ms;
  using py_to_us = internal::PyDateTime_to_us;
  using py_to_ns = internal::PyDateTime_to_ns;

  using np_traits = internal::npy_traits<NPY_DATETIME>;
  using NpScalarObject = PyDatetimeScalarObject;
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that using can only be used with type names, and not function names?
(I could define an inline function that wraps the correct function for each, but that gets a bit verbose?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline functions sound fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I pushed a version with the template (I am personally not fully sure the end result is easier / better than some duplicated code, but ok ..)

@jorisvandenbossche
Copy link
Member Author

Also opened a few follow-up items that I ran into: https://issues.apache.org/jira/browse/ARROW-6778 (casting), overflow in conversion to ns (https://issues.apache.org/jira/browse/ARROW-6779) and parquet support (https://issues.apache.org/jira/browse/ARROW-6780).

Another question is if we want to call this type "timedelta" instead of "duration" in the python interface (like we also have pa.float64() and not pa.double()), since standard library, numpy and pandas all call this "timedelta").
Although I actually see this float64 / double is rather inconsistent (as the type you create with pa.float64still prints as "double")

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks globally fine, but some questions?

@@ -230,6 +231,26 @@ inline void set_numpy_metadata(int type, const DataType* datatype, PyArray_Descr
// datatype->type == Type::DATE64
date_dtype->meta.base = NPY_FR_D;
}
} else if (type == NPY_TIMEDELTA) {
auto date_dtype = reinterpret_cast<PyArray_DatetimeDTypeMetaData*>(out->c_metadata);
if (datatype->id() == Type::DURATION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shouldn't happen (as NPY_TIMEDELTA should only be passed in case of Type::DURATION). So I can just leave out the if check? (and maybe add a debug assert?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, add a debug assert.

cpp/src/arrow/python/arrow_to_pandas.cc Show resolved Hide resolved
cpp/src/arrow/python/arrow_to_pandas.cc Show resolved Hide resolved
cpp/src/arrow/python/datetime.h Outdated Show resolved Hide resolved
@@ -449,6 +449,63 @@ class TimestampConverter
TimeUnit::type unit_;
};

template <NullCoding null_coding>
class DurationConverter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use something like this:

template <typename ArrowType>
struct PyDateTimeTraits {};

template <>
struct PyDateTimeTraits<TimestampType> {
  using PyTypeCheck = PyDateTime_Check;
  using PyDateTimeObject = PyDateTime_DateTime;

  using py_to_s = internal::PyDateTime_to_s;
  using py_to_ms = internal::PyDateTime_to_ms;
  using py_to_us = internal::PyDateTime_to_us;
  using py_to_ns = internal::PyDateTime_to_ns;

  using np_traits = internal::npy_traits<NPY_DATETIME>;
  using NpScalarObject = PyDatetimeScalarObject;
};

python/pyarrow/tests/test_array.py Show resolved Hide resolved
@@ -863,6 +867,34 @@ def test_datetime_subclassing():
assert arr_us[0].as_py() == datetime.datetime(2007, 7, 13, 1,
23, 34, 123456)

data = [
MyTimedelta(1, 1, 1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test the conversion with more interesting values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What values would you find interesting? (thinking about specific corner cases?)

(it's mainly the fact that the subclass is working that is being tested here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something less trivial than days and seconds having the same value.

@pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
def test_sequence_duration_with_unit(unit):
data = [
datetime.timedelta(1, 1, 1001),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to test with more interesting values here as well...

def test_duration(self):
arr = np.array([0, 3600000000000], dtype='timedelta64[ns]')

units = ['us', 'ms', 's']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test nanoseconds as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nanoseconds are tested below separately, as they have a different behaviour for with or without pandas

python/pyarrow/tests/test_scalars.py Outdated Show resolved Hide resolved
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Will update for your suggestions later

@@ -230,6 +231,26 @@ inline void set_numpy_metadata(int type, const DataType* datatype, PyArray_Descr
// datatype->type == Type::DATE64
date_dtype->meta.base = NPY_FR_D;
}
} else if (type == NPY_TIMEDELTA) {
auto date_dtype = reinterpret_cast<PyArray_DatetimeDTypeMetaData*>(out->c_metadata);
if (datatype->id() == Type::DURATION) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shouldn't happen (as NPY_TIMEDELTA should only be passed in case of Type::DURATION). So I can just leave out the if check? (and maybe add a debug assert?)

cpp/src/arrow/python/arrow_to_pandas.cc Show resolved Hide resolved
cpp/src/arrow/python/datetime.h Outdated Show resolved Hide resolved
@@ -926,6 +931,16 @@ def test_cast_date32_to_int():
assert result2.equals(arr)


@pytest.mark.xfail(strict=True) # TODO implement duration cast
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already opened https://issues.apache.org/jira/browse/ARROW-6778 for that (but I also already wrote the test when I found out, so thought to just keep it)

python/pyarrow/tests/test_array.py Show resolved Hide resolved
@@ -863,6 +867,34 @@ def test_datetime_subclassing():
assert arr_us[0].as_py() == datetime.datetime(2007, 7, 13, 1,
23, 34, 123456)

data = [
MyTimedelta(1, 1, 1000),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What values would you find interesting? (thinking about specific corner cases?)

(it's mainly the fact that the subclass is working that is being tested here)

def test_duration(self):
arr = np.array([0, 3600000000000], dtype='timedelta64[ns]')

units = ['us', 'ms', 's']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nanoseconds are tested below separately, as they have a different behaviour for with or without pandas

python/pyarrow/tests/test_scalars.py Outdated Show resolved Hide resolved
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated for your feedback! Added some additional questions

cpp/src/arrow/python/arrow_to_pandas.cc Show resolved Hide resolved
@@ -449,6 +449,63 @@ class TimestampConverter
TimeUnit::type unit_;
};

template <NullCoding null_coding>
class DurationConverter
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that using can only be used with type names, and not function names?
(I could define an inline function that wraps the correct function for each, but that gets a bit verbose?)

@jorisvandenbossche
Copy link
Member Author

@ursabot crossbow submit docker-python-3.6-nopandas

@ursabot
Copy link

ursabot commented Oct 4, 2019

AMD64 Conda Crossbow Submit (#67612) builder has been succeeded.

Revision: f35aa25

Submitted crossbow builds: ursa-labs/crossbow @ ursabot-238

Task Status
docker-python-3.6-nopandas CircleCI

@codecov-io
Copy link

codecov-io commented Oct 4, 2019

Codecov Report

Merging #5566 into master will increase coverage by 0.55%.
The diff coverage is 84.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5566      +/-   ##
==========================================
+ Coverage   88.79%   89.35%   +0.55%     
==========================================
  Files         983      791     -192     
  Lines      132170   117358   -14812     
  Branches     1501        0    -1501     
==========================================
- Hits       117362   104865   -12497     
+ Misses      14443    12493    -1950     
+ Partials      365        0     -365
Impacted Files Coverage Δ
python/pyarrow/tests/test_schema.py 100% <ø> (ø) ⬆️
python/pyarrow/__init__.py 71.25% <ø> (ø) ⬆️
python/pyarrow/tests/test_misc.py 94.73% <ø> (ø) ⬆️
python/pyarrow/array.pxi 79.37% <0%> (-0.16%) ⬇️
python/pyarrow/tests/test_types.py 97.33% <100%> (+0.11%) ⬆️
python/pyarrow/tests/test_convert_builtin.py 97.5% <100%> (+0.25%) ⬆️
python/pyarrow/types.py 97.89% <100%> (+0.04%) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 95.96% <100%> (+0.01%) ⬆️
cpp/src/arrow/python/datetime.h 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_array.py 93.52% <100%> (-0.07%) ⬇️
... and 231 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af097e6...cd9f12a. Read the comment docs.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, just a nit

static constexpr const char* const np_name = "timedelta64";
};

template <NullCoding null_coding, typename Type>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: call this ArrowType?

@jorisvandenbossche
Copy link
Member Author

@pitrou updated, and all green!

@pitrou pitrou closed this in c805b5f Oct 8, 2019
@pitrou
Copy link
Member

pitrou commented Oct 8, 2019

Thanks @jorisvandenbossche !

@jorisvandenbossche jorisvandenbossche deleted the ARROW-5855-duration-python branch October 8, 2019 11:17
@jorisvandenbossche
Copy link
Member Author

And thanks a lot for your review and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants