ARROW-618: [Python/C++] Support timestamp+timezone conversion to pandas by wesm · Pull Request #375 · apache/arrow

wesm · 2017-03-13T00:00:35Z

This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in pyarrow/adapters/pandas.cc. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.

The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be timestamp('ns', tz='US/Eastern'), then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.

Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.

I also changed the C++ pyarrow namespace to arrow::py which will make it less painful to move that code tree to cpp/src/arrow/python at some point

wesm · 2017-03-13T03:32:24Z

rebased

xhochy · 2017-03-13T07:36:38Z

cpp/src/arrow/type.cc

Reference instead of pointer?

Google style guide recommends pointers for mutable arguments / not using mutable reference arguments https://google.github.io/styleguide/cppguide.html#Reference_Arguments -- our code isn't 100% consistent about this, but we should try to follow that convention. In the case of std::ostream you can see this applied in Protocol Buffers and other Google codebases: https://github.com/google/protobuf/blob/fd046f6263fb17383cafdbb25c361e3451c31105/src/google/protobuf/io/zero_copy_stream_impl.h#L265

here's some examples in TensorFlow too

https://github.com/tensorflow/tensorflow/blob/be52c5c09e39ac2df007fb2d62abe122d5ade6d0/tensorflow/core/platform/default/logging.cc#L155

Ok, that makes sense. Thanks!

xhochy · 2017-03-13T07:38:13Z

python/pyarrow/__init__.py

Importing it here as private seems like useless as I thought the use of the imports here was mainly to expose the public interface?

this was pretty annoying, after from pyarrow.schema import schema the statement import pyarrow.schema returns pyarrow.schema.schema instead of the module. Other ideas?

No, then we keep it.

xhochy · 2017-03-13T07:39:33Z

python/pyarrow/array.pyx

We should probably reduce the usage of the reserved word type in our codebase.

Since there's __builtin__.type I haven't worried too much about it, but I'm OK with another naming convention, like type_, what do you prefer?

Either type_ or dtype would be ok for me but as it's internally used, we can keep it. I wasn't 100% sure about the implications of using a reserved keyword.

tebeka · 2017-03-13T10:52:20Z

python/pyarrow/schema.pyx

IMO this is not thread safe, do we care?

no, as we never delete anything from the cache.

What is not thread safe here? The GIL is held here

Sorry, missed the part where you hold the GIL

Oh I see, we could have two threads adding unit to the cache. Yeah I think it's fine. I was thinking about C-level threadsafety

… using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py

Change-Id: Ia05a8eba9833874b48726ed6a221577937920eed

Change-Id: Ib8841ed34d6153f354f3364f20464e42ae885d7d

wesm · 2017-03-13T17:23:03Z

Thanks -- rebased, will merge on green build

wesm · 2017-03-13T20:15:20Z

Looks like the OS X builds are consistently failing (https://travis-ci.org/wesm/arrow/jobs/210644296). This seems to be something wrong with the miniconda environment bootstrap. Merging and we can investigate further if it is not transient

This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in `pyarrow/adapters/pandas.cc`. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least. The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be `timestamp('ns', tz='US/Eastern')`, then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter. Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive. I also changed the C++ `pyarrow` namespace to `arrow::py` which will make it less painful to move that code tree to `cpp/src/arrow/python` at some point Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/ARROW-618 and squashes the following commits: 4b18bfa [Wes McKinney] Fix rebase conflict 5bc3724 [Wes McKinney] Fix rebase issues 870986f [Wes McKinney] Refactor ArrowSerializer to not be a template and use visitor pattern using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py

… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: Ib789b38872430bb3903f233a795b84357df6385a

… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b

… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b

wesm force-pushed the ARROW-618 branch from ad4ad81 to b2b6a93 Compare March 13, 2017 03:31

xhochy reviewed Mar 13, 2017

View reviewed changes

tebeka reviewed Mar 13, 2017

View reviewed changes

jreback mentioned this pull request Mar 13, 2017

API: expose pandas.errors pandas-dev/pandas#15541

Closed

wesm added 3 commits March 13, 2017 13:18

Refactor ArrowSerializer to not be a template and use visitor pattern…

870986f

… using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py

Fix rebase issues

5bc3724

Change-Id: Ia05a8eba9833874b48726ed6a221577937920eed

Fix rebase conflict

4b18bfa

Change-Id: Ib8841ed34d6153f354f3364f20464e42ae885d7d

wesm force-pushed the ARROW-618 branch from b2b6a93 to 4b18bfa Compare March 13, 2017 17:22

asfgit closed this in 00df40c Mar 13, 2017

wesm deleted the ARROW-618 branch March 13, 2017 20:20

Conversation

wesm commented Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Mar 13, 2017

Uh oh!

wesm commented Mar 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Mar 13, 2017 •

edited

Loading