Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-203: Python: Basic filename based Parquet read/write #83

Closed
wants to merge 28 commits into from

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented May 29, 2016

No description provided.

@xhochy
Copy link
Member Author

xhochy commented May 29, 2016

Travis fails because we do not build parquet. This raises now the question how we should pull in parquet-cpp in the ci setup. Using a git-commit hash or just pull HEAD? Also via the thirdparty scripts or via a different approach?

@xhochy
Copy link
Member Author

xhochy commented May 29, 2016

Some thoughts on my questions from above:

  • I would fix a commit-hash / release for parquet-cpp in the thirdparty-build
  • For now, we could use the existing thirdparty build infrastructure. There may be a need in future to have thridparty builds inside CMake for compiler option passing etc (e.g. for full debug build or when using LTO) but I would do that then separately en bloc.


# Must be in one expression to avoid calling std::move which is not possible
# in Cython (due to missing rvalue support)
reader = unique_ptr[FileReader](new FileReader(default_memory_pool(),
Copy link
Contributor

@emkornfield emkornfield May 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use reader.reset, so you aren't relying on the move functionality/R-Values references at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not due to the assignment to reader but about forwarding the return value of ParquetFileReader.OpenFile(filename).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably going run into some issues with Cython's lackluster support for C++11 features. What I've done in other cases is create a shim layer either as a header-only or header + companion .cpp file that provides a simplified API for the Cython caller. I think for now it's good to get this end-to-end use case working

@emkornfield
Copy link
Contributor

seems reasonable to me. I'm not super familiar with this part of the code yet, so getting some feedback from @wesm also likely makes sense. (Also it looks like the build is failing with the latest push)

std::shared_ptr<Table> out;
ReadTableFromFile(std::move(file_reader), &out);
ASSERT_EQ(1, out->num_columns());
ASSERT_EQ(100, out->num_rows());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good to get in the habit of doing const int num_rows = 100

@wesm
Copy link
Member

wesm commented Jun 1, 2016

it's really great that you got this working!

re: the thirdparty toolchain, at least for testing the Python side, we may be able to use the conda dev artifacts to simplify things (https://anaconda.org/apache/parquet-cpp). Having to build parquet-cpp from source (and its thirdparty, especially Thrift) kind of stinks. let me know what you think

if (!primitive_array) {
PARQUET_IGNORE_NOT_OK(writer.Close());
return Status::NotImplemented("Table must consist of PrimitiveArray instances");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield per our related discussion about strings, at least in Parquet-land, variable and fixed-length binary values (and strings) are considered primitive types. So you could have a table of all string columns and it would be still semantically "flat". I don't think it's a big deal but it will impact code like this (obviously we will want to address nested data as soon as we can here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Thats a fair point, but I'm not sure the semantic flatness makes a big difference here. The complicated part of converting the arrow binary/strings to parquet still remains and the best way to convert the data depends on the encoding. If it is plain encoding, then we would need to convert the ParquetByteArray, if its delta encoded length then a new batch API for the parquet writer is likely more appropriate.

I'm not attached to either representation so if we can get the type system to help better support some use-cases I'm for it. But it feels like for parquet the advantages of making string types primitive might not necessarily solve the harder problems. This debate will all go away when we JIT everything, right ;-) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more of a question of the type hierarchy. Conceivably we could have code involving std::is_base_of with PrimitiveArray. Does not need to be resolved here

@wesm
Copy link
Member

wesm commented Jun 9, 2016

I left minor comments but this is a great start to get this working. Can you add a JIRA for adding support and testing for strings, boolean, and smaller integer types (int8, 16). We will also need to deal with unsigned integers (but the will come back at signed integers until Parquet 2.0 happens someday...c'est la vie)

@xhochy
Copy link
Member Author

xhochy commented Jun 10, 2016

Addressed all comments. For the conda things, I probably need some expert advice (toolchain and LD_LIBRARY_PATH) otherwise everything is included.

@wesm
Copy link
Member

wesm commented Jun 10, 2016

Thank you. In the interest of getting this in I'm merging this now; we should address the build/packaging comment separately (particularly if conda install pyarrow -c apache/channel/dev doesn't work out of the box after this patch)

@wesm
Copy link
Member

wesm commented Jun 10, 2016

+1

@asfgit asfgit closed this in ec66ddd Jun 10, 2016
@xhochy xhochy deleted the arrow-203 branch March 7, 2017 16:16
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 2, 2018
…etFileReader

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits:

d7f9c47 [Deepak Majeti] modified ParquetFileReader API
b5269f4 [Deepak Majeti] modified travis ci script to print log on failure
940de12 [Deepak Majeti] fixed memory leak
6de0b70 [Deepak Majeti] print logs
e5ea341 [Deepak Majeti] print failure
9462954 [Deepak Majeti] try fixing test failure
3a0a1d7 [Deepak Majeti] modified External Stream
9d8b44c [Deepak Majeti] fixed formatting
11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Sep 4, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2018
…etFileReader

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits:

d7f9c47 [Deepak Majeti] modified ParquetFileReader API
b5269f4 [Deepak Majeti] modified travis ci script to print log on failure
940de12 [Deepak Majeti] fixed memory leak
6de0b70 [Deepak Majeti] print logs
e5ea341 [Deepak Majeti] print failure
9462954 [Deepak Majeti] try fixing test failure
3a0a1d7 [Deepak Majeti] modified External Stream
9d8b44c [Deepak Majeti] fixed formatting
11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader

Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 6, 2018
…etFileReader

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits:

d7f9c47 [Deepak Majeti] modified ParquetFileReader API
b5269f4 [Deepak Majeti] modified travis ci script to print log on failure
940de12 [Deepak Majeti] fixed memory leak
6de0b70 [Deepak Majeti] print logs
e5ea341 [Deepak Majeti] print failure
9462954 [Deepak Majeti] try fixing test failure
3a0a1d7 [Deepak Majeti] modified External Stream
9d8b44c [Deepak Majeti] fixed formatting
11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader

Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 7, 2018
…etFileReader

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits:

d7f9c47 [Deepak Majeti] modified ParquetFileReader API
b5269f4 [Deepak Majeti] modified travis ci script to print log on failure
940de12 [Deepak Majeti] fixed memory leak
6de0b70 [Deepak Majeti] print logs
e5ea341 [Deepak Majeti] print failure
9462954 [Deepak Majeti] try fixing test failure
3a0a1d7 [Deepak Majeti] modified External Stream
9d8b44c [Deepak Majeti] fixed formatting
11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader

Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
…etFileReader

Author: Deepak Majeti <deepak.majeti@hpe.com>

Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits:

d7f9c47 [Deepak Majeti] modified ParquetFileReader API
b5269f4 [Deepak Majeti] modified travis ci script to print log on failure
940de12 [Deepak Majeti] fixed memory leak
6de0b70 [Deepak Majeti] print logs
e5ea341 [Deepak Majeti] print failure
9462954 [Deepak Majeti] try fixing test failure
3a0a1d7 [Deepak Majeti] modified External Stream
9d8b44c [Deepak Majeti] fixed formatting
11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader

Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
praveenbingo added a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
Introducing a cache to hold the projectors and filters for re-use.
The cache is a LRU that can hold 100 entries.
xuechendi pushed a commit to xuechendi/arrow that referenced this pull request Aug 19, 2020
[oap-native-sql] Cast to String and cast to smaller scale type support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants