Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

Closed
asfimport opened this issue Jul 12, 2018 · 1 comment
Closed

Comments

@asfimport
Copy link

This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error


TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long\*, void\*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char\*)
 @ parquet::SerializedFile::ParseMetaData()
 @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool\*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >\*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object\*, _object\*, _object\*)

The following code causes it:


import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

 

I am working on writing a unit test for this. Note that I am using libhdfs3.

Reporter: Robbie Gruener / @rgruener

Original Issue Attachments:

Note: This issue was originally created as ARROW-2842. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Robbie Gruener / @rgruener:
I have not been able to reproduce well. It likely was due to an hdfs connection issue and not an issue with pyarrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant