ARROW-618: [Python/C++] Support timestamp+timezone conversion to pandas#375
ARROW-618: [Python/C++] Support timestamp+timezone conversion to pandas#375wesm wants to merge 3 commits intoapache:masterfrom
Conversation
|
rebased |
cpp/src/arrow/type.cc
Outdated
There was a problem hiding this comment.
Google style guide recommends pointers for mutable arguments / not using mutable reference arguments https://google.github.io/styleguide/cppguide.html#Reference_Arguments -- our code isn't 100% consistent about this, but we should try to follow that convention. In the case of std::ostream you can see this applied in Protocol Buffers and other Google codebases: https://github.com/google/protobuf/blob/fd046f6263fb17383cafdbb25c361e3451c31105/src/google/protobuf/io/zero_copy_stream_impl.h#L265
There was a problem hiding this comment.
here's some examples in TensorFlow too
python/pyarrow/__init__.py
Outdated
There was a problem hiding this comment.
Importing it here as private seems like useless as I thought the use of the imports here was mainly to expose the public interface?
There was a problem hiding this comment.
this was pretty annoying, after from pyarrow.schema import schema the statement import pyarrow.schema returns pyarrow.schema.schema instead of the module. Other ideas?
python/pyarrow/array.pyx
Outdated
There was a problem hiding this comment.
We should probably reduce the usage of the reserved word type in our codebase.
There was a problem hiding this comment.
Since there's __builtin__.type I haven't worried too much about it, but I'm OK with another naming convention, like type_, what do you prefer?
There was a problem hiding this comment.
Either type_ or dtype would be ok for me but as it's internally used, we can keep it. I wasn't 100% sure about the implications of using a reserved keyword.
python/pyarrow/schema.pyx
Outdated
There was a problem hiding this comment.
IMO this is not thread safe, do we care?
There was a problem hiding this comment.
no, as we never delete anything from the cache.
There was a problem hiding this comment.
What is not thread safe here? The GIL is held here
There was a problem hiding this comment.
Sorry, missed the part where you hold the GIL
There was a problem hiding this comment.
Oh I see, we could have two threads adding unit to the cache. Yeah I think it's fine. I was thinking about C-level threadsafety
… using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py
Change-Id: Ia05a8eba9833874b48726ed6a221577937920eed
Change-Id: Ib8841ed34d6153f354f3364f20464e42ae885d7d
|
Thanks -- rebased, will merge on green build |
|
Looks like the OS X builds are consistently failing (https://travis-ci.org/wesm/arrow/jobs/210644296). This seems to be something wrong with the miniconda environment bootstrap. Merging and we can investigate further if it is not transient |
This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in `pyarrow/adapters/pandas.cc`. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.
The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be `timestamp('ns', tz='US/Eastern')`, then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.
Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.
I also changed the C++ `pyarrow` namespace to `arrow::py` which will make it less painful to move that code tree to `cpp/src/arrow/python` at some point
Author: Wes McKinney <wes.mckinney@twosigma.com>
Closes apache#375 from wesm/ARROW-618 and squashes the following commits:
4b18bfa [Wes McKinney] Fix rebase conflict
5bc3724 [Wes McKinney] Fix rebase issues
870986f [Wes McKinney] Refactor ArrowSerializer to not be a template and use visitor pattern using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: Ib789b38872430bb3903f233a795b84357df6385a
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
… 90-character line width The main change is horizontal alignment. We should also do a clang-tidy pass sometime to do some further scrubbing Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #375 from wesm/PARQUET-1068 and squashes the following commits: b81145d [Wes McKinney] Modify .clang-format to use straight Google format with 90-character line width Change-Id: If8345d1d2a03d785ed41a5848de2c40e4bf53b5b
This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in
pyarrow/adapters/pandas.cc. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least.The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be
timestamp('ns', tz='US/Eastern'), then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter.Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive.
I also changed the C++
pyarrownamespace toarrow::pywhich will make it less painful to move that code tree tocpp/src/arrow/pythonat some point