[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

asfimport · 2018-07-12T21:53:29Z

This might be a bug in parquet-cpp, I need to spend a bit more time tracking this down but basically given a file with a single row on hdfs, reading it with pyarrow yields this error


TcpSocket.cpp: 79: HdfsEndOfStream: Read 8 bytes failed from: End of the stream
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ Unknown
 @ arrow::io::HdfsReadableFile::ReadAt(long, long, long\*, void\*)
 @ parquet::ArrowInputFile::ReadAt(long, long, unsigned char\*)
 @ parquet::SerializedFile::ParseMetaData()
 @ parquet::ParquetFileReader::Contents::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::ParquetFileReader::Open(std::unique_ptr<parquet::RandomAccessSource, std::default_delete<parquet::RandomAccessSource> >, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&)
 @ parquet::arrow::OpenFile(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool\*, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData> const&, std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >\*)
 @ __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object\*, _object\*, _object\*)

The following code causes it:


import pyarrow

import pyarrow.parquet as pq

 

fs = pyarrow.hdfs.connect('my-namenode-url', driver='libhdfs3') # fill in namenode information

file_object = fs.open('single-row.parquet') # update for hdfs path of file

pq.read_metadata(file_object) # this works

parquet_file = pq.ParquetFile(file_object)

parquet_file.read_row_group(0) # throws error

I am working on writing a unit test for this. Note that I am using libhdfs3.

Reporter: Robbie Gruener / @rgruener

Original Issue Attachments:

single-row.parquet

_{Note: This issue was originally created as ARROW-2842. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-07-25T19:24:08Z

Robbie Gruener / @rgruener:
I have not been able to reproduce well. It likely was due to an hdfs connection issue and not an issue with pyarrow

asfimport closed this as completed Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

asfimport commented Jul 12, 2018

asfimport commented Jul 25, 2018

[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

[Python] Cannot read parquet files with row group size of 1 From HDFS #19217

Comments

asfimport commented Jul 12, 2018

Original Issue Attachments:

asfimport commented Jul 25, 2018