Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources #8300

Closed
wants to merge 13 commits into from

Conversation

rdettai
Copy link
Contributor

@rdettai rdettai commented Sep 29, 2020

@github-actions
Copy link

@rdettai
Copy link
Contributor Author

rdettai commented Sep 29, 2020

I first moved the footer parsing out of the SerializedFileReader. I will now try to move as much logic as possible from the SerializedFileReader and SerializedRowGroupReader implems to the FileReader and RowGroupReader traits

@alamb
Copy link
Contributor

alamb commented Sep 30, 2020

from https://issues.apache.org/jira/browse/ARROW-10135, it seems like your goal is to support reading parquet files from sources other than files.

I wonder if you have tried implementing the ParquetReader trait for another data source?

Then you can create a file reader from that other source like:

impl ParquetReader for ThingThatImplementsParquetReader {
...
}
...

let source = ThingThatImplementsParquetReader::new();
let file_reader = SerializedFileReader::new(source);
...

@rdettai
Copy link
Contributor Author

rdettai commented Oct 7, 2020

The discussion with @alamb about the need for an intermediate layer when reading a parquet file is discussed on JIRA

The highlights of the current implementation:

  • The public API has changed, but keeps working for File and Path thanks to the corresponding trait implementations. Cursor cannot be used any more because it requires data copies when being passed around with clone() (this was already the case before in the implem of TryClone for Cursor<Vec<u8>>).
  • I have added a custom cursor type (SliceableCursor) that allows to generate cursor slices without cloning the underlying data. This can be used to read in-memory files. I guess it could be made more generic, but this would be for convenience only and I find it simple and clear as is.
  • I have separated the implem (SerializedFileReader, SerializedRowGroupReader...) from the traits (FileReader, RowGroupReader...) for more clarity. I know that this is not how the code base is structured in the rest of the project but I tend to get lost in these huge files with millions of struct/trait/impl blocks. I'm very much open to suggestion about this point!
  • There is nothing about async/parallelism yet, I have to think about it a little bit.

@alamb : can you take a look at the new ChunckReader trait and how it is integrated to the rest of the reader? What do you think about it?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdettai -- I looked at ChunckReader and skimmed how it was used in the rest of the code.

TLDR is I think it is a nice improvement and in my opinion this PR moves the Rust parquet writer forward.

Some comments:

  1. I think using a more standard spelling of Chunk rather than Chunck might improve the readability somewhat
  2. I wonder how important separating out the types andget_read and get_read_seek are? Given the underlying implementation of ChunckReader will have to implement seekable streams anyways, I wonder how much more performance is to be gained for the additional complexity in API (indeed both File and SliceableCursor appear to use the same type for T and U)
  3. I like SliceableCursor -- it is a nice way to avoid the copies that occur with the actual Cursor implementation
  4. I think we should begin moving the parquet reader towards using Arc rather than Rc so as to better prepare ourselves for multi-threading


pub trait ReadChunck: Read + Length {}
pub trait ReadSeekChunck: ReadChunck + Seek {}
impl<T: Read + Length> ReadChunck for T {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern (of automatically implementing the combined trait for anything that implements Read + Length (I struggled doing that with FileReader when working with the Parquet API earlier in my Rust days)


/// This is object to use if your file is already in memory.
/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally suggest moving to Arc here rather than Rc across the board in the Parquet reader. I suggest Arc to "future proof" the code as Arc almost always is needed with async and multi-threading (as Rc doesn't implement Send)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement this separately? That'd be my preference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes maybe move this into a separate PR? given the current one is already big enough.

Copy link
Contributor Author

@rdettai rdettai Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a new feature but an adaptation of what we had before. Without SliceableCursor there is no way to read from RAM the way it was done with Cursor. Instead of directly implementing the new ChunkReader trait for Cursor and having the same problem of copying the data all over the place as we did before, I tweeked the Cursor a bit to share its data between the Read slices.

rust/parquet/src/util/io.rs Show resolved Hide resolved
@rdettai
Copy link
Contributor Author

rdettai commented Oct 7, 2020

Thanks a lot for your time and comments!

using a more standard spelling of Chunk rather than Chunck

You mean "a more correct spelling" 😉

I wonder how important separating out get_read and get_read_seek is

Right, I was just looking at this. There might be some simplifications that could be implemented here.

I think we should begin moving the parquet reader towards using Arc

That is true, but there is a major problem with the way FileSource works in that case. The fact that we share a single file handle between instances (because it is the behavior of File::clone()) prevents us from accessing concurrently a same file.

Give me a bit of time, I will try to fix these. I'll come back to you if I have further doubts. Thanks again !!!

@rdettai rdettai marked this pull request as ready for review October 8, 2020 16:26
@rdettai
Copy link
Contributor Author

rdettai commented Oct 8, 2020

@alamb I think I addressed most of your concerns. The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR. There is more to it than converting Rc to Arc because of this shared handle to the file. Lets do that work in an other PR associated to a sub-tasks of ARROW-9464 or simply associated to ARROW-9674. I will work on it next week, but until then we can validate and merge this one.

@alamb
Copy link
Contributor

alamb commented Oct 8, 2020

The only one that remains is the necessity to prepare for async, but I have digged a little bit into and I think that tackling this properly will require work that is not really in the scope of this PR

I agree

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to compile an internal project that uses this branch against an internal project we have (that uses SerializedFileReader with something other than File and I hit several compiler errors which I haven't fully worked through.

The first is with TryClone being private to the crate (comments inline below) though upon further reading of this PR it seems like the right thing for me to have done is to change the struct we are using to implement ChunkReader instead of TryClone, Length, etc.

Another error is the following:

error[E0599]: no method named `metadata` found for struct `parquet::file::serialized_reader::SerializedFileReader<R>` in the current scope
   --> delorean_parquet/src/metadata.rs:46:35
    |
46  |     let parquet_metadata = reader.metadata();
    |                                   ^^^^^^^^ private field, not a method

Sadly I ran out of time this morning to keep looking at this. I will try and spend some more time either later today or perhaps tomorrow.

rust/parquet/src/file/reader.rs Outdated Show resolved Hide resolved
rust/parquet/src/file/reader.rs Show resolved Hide resolved
/// get a serialy readeable slice of the current reader
/// This should fail if the slice exceeds the current bounds
fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
}

// ----------------------------------------------------------------------
Copy link
Contributor

@alamb alamb Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to compile an in-house project with this branch I got the following error:


error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`
  --> delorean_parquet/src/lib.rs:13:28
   |
13 |     file::reader::{Length, TryClone},
   |                            ^^^^^^^^ no `TryClone` in `parquet::file::reader`

error[E0432]: unresolved import `delorean_parquet::parquet::file::reader::TryClone`

When I tried to use the copy in util::io I got:

error[E0603]: module `util` is private
  --> delorean_parquet/src/lib.rs:14:5
   |
14 |     util::io::{TryClone},
   |     ^^^^ private module
   |

When I used the import in seriailized_reader I got a similar error:

error[E0603]: trait `TryClone` is private
  --> delorean_parquet/src/lib.rs:14:30
   |
14 |     file::serialized_reader::TryClone,
   |                              ^^^^^^^^ private trait
   |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb is this still an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is (as the ParquetWriter uses TryClone) but I may not have tested with the latest changes from @rdettai

rust/parquet/src/util/cursor.rs Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Oct 10, 2020

Sorry I don't have more time to spend on this -- I am still struggling to get my existing project code to compile against this branch and this PR is pretty massive to digest.

One other thing I found this morning is that the TryClone trait is still used in ParquetWriter so I do think it needs to be publically exported somehow (or we need to refactor the writer)

I have also been trying to figure out if the ParquetReader trait is still needed after this refactoring.

So I guess all in all my conclusion is that the ideas behind this PR are good but:

  1. It is hard to review given its size
  2. There will likely be a fairly substantial backwards incompatible change for anyone using the ParquetReader in their project

@nevi-me
Copy link
Contributor

nevi-me commented Oct 10, 2020

I'll also only be able to review this in the next 2 week(end)s, and I'd like to go through it before it's merged.

I'm also going to reach out to a few active crates that depend on Parquet, so I can solicit their feedback on these changes.
I anticipate a lot of changes + improvements for the 3.0.0 release, so if we can manage that without causing a lot of breakage; I think we should go for it.

@alamb
Copy link
Contributor

alamb commented Oct 11, 2020

I should probably make clear that anyone using the SerializedFileReader with File will likely have no issues (as @rdettai has done a great job of keeping the that level interface consistent. Any project that uses a custom input source (as we did in my internal project) by implementing ParquetReader and associated traits may have more to change.

I actually think the changes in this PR will make such custom input sources easier to write in the future, but existing code will have to be rejiggered.

@rdettai
Copy link
Contributor Author

rdettai commented Oct 13, 2020

I'm open to moving traits such as ParquetReader back to reader.rs or re-exposing them with pub use crate::. But I have the feeling these traits are very specific to the implementation of the util::io::FileSource, which is private because it is mainly a slicing/buffering wrapper around File (hence the name ;-)). This wrapper makes sense because File is cheap to clone and needs to be buffered, but this will not be generally the case for other implems of ParquetReader. What about your usecase @alamb ? Do you think it is worth it have the ChunkReader be implemented for T:ParquetReader rather than File ? This would be more flexible, maybe a little bit too much...

It is a good observation that TryClone is also used in the writer. I didn't want to move things around in the writer because it would have made the PR even more massive and also because I've been told that there was an active development effort going on over there :-). Similar changes might apply.

@alamb
Copy link
Contributor

alamb commented Oct 13, 2020

@rdettai --

This wrapper makes sense because File is cheap to clone and needs to be buffered, but this will not be generally the case for other implems of ParquetReader.

Yes, actually that was our experience (when we used a buffered implementation, we found that the underlying (large) buffer got copies around a lot.

I think in general, the cleaner thing (and my preference) would be to simply port my code to use ChunkReader (I think it will end up being cleaner and likely more performant).

Do you think it is worth it have the ChunkReader be implemented for T:ParquetReader rather than File ?

The upside is that it would be somewhat more backwards compatible, but unless other reviewers feel strongly I personally think we should just do the breaking change, merge this PR, (maybe even remove ParquetReader too) and I'll rewrite our project in terms of ChunkReader

@alamb
Copy link
Contributor

alamb commented Oct 13, 2020

I am but one opinion -- I wonder if @nevi-me or @sunchao have any thoughts / opinions about where this PR is heading

@rdettai
Copy link
Contributor Author

rdettai commented Oct 14, 2020

An argument in favor of having ChunkReader be implemented for T:ParquetReader rather than File:
-> this allows to write wrappers around File that you can then pass to SerializedFileReader. For instance at some points I wanted to debug the number of actual read ops on the File object so I wrote a tiny wrapper that incremented a counter at each read() call...

@rdettai rdettai force-pushed the ARROW-10135-parquet-file-reader branch from 6997952 to a084211 Compare October 16, 2020 07:52
@nevi-me
Copy link
Contributor

nevi-me commented Oct 17, 2020

I am but one opinion -- I wonder if @nevi-me or @sunchao have any thoughts / opinions about where this PR is heading

I'm principally fine with the changes, and the direction of the PR. I'll highly likely be working more on the non-Arrow side of Parquet, so I expect that I might be making a lot of changes on top of this.

There seems to be some demand for supporting sources and sinks other than files, and as this helps with that; I'm pro getting it merged.

@rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review.

@alamb
Copy link
Contributor

alamb commented Oct 18, 2020

@nevi-me -- I think this is good enough to merge; Once it is merged, I'll try and port my internal project asap (which has custom readers and writers) and make a PR to do any touchups needed (like re-exporting TryClone if need be)

@sunchao
Copy link
Member

sunchao commented Oct 19, 2020

From the description the changes proposed here makes sense to me as well - but seems the PR itself is messed up and contains many unrelated changes.

rdettai and others added 5 commits October 20, 2020 00:46
also removed seek from ChunkReader and fixed typo on "chunck" word
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@kszucs kszucs force-pushed the ARROW-10135-parquet-file-reader branch from a084211 to 5f9ba42 Compare October 19, 2020 22:46
@rdettai
Copy link
Contributor Author

rdettai commented Oct 20, 2020

@rdettai are there still more changes that you intend on making, and @alamb are all your queries and concerns addressed? Thanks for the detailed review.

@nevi-me Depends on whether we want ParquetReader to remain public or not. If not, I think the PR is fine, otherwise, I can bring it back into parquet::file::reader.

@sunchao I tried to restrain myself on this PR 😄. Honestly, I had to move quite a lot of things around because this touches a core API and things were very "monolithic". There are two points were I'm getting a little bit out of the main concern:

  • I added a Seek implem to the FileSource that ended up not being useful. I am removing it right now.
  • The typo fix in array_reader, but I'm sure you can forgive me that one :-)

@sunchao
Copy link
Member

sunchao commented Oct 20, 2020

Thanks @rdettai . I'll take a look at this PR today.

rust/parquet/src/file/footer.rs Outdated Show resolved Hide resolved
rust/parquet/src/file/serialized_reader.rs Outdated Show resolved Hide resolved
rust/parquet/src/util/io.rs Show resolved Hide resolved
rust/parquet/src/file/footer.rs Show resolved Hide resolved
/// The ChunkReader trait generates readers of chunks of a source.
/// For a file system reader, each chunk might contain a clone of File bounded on a given range.
/// For an object store reader, each read can be mapped to a range request.
pub trait ChunkReader: Length {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe ChunkRead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Reader because it does not implem Read but generates things that implem Read. But I admit that I'm not 100% happy with ChunkReader and it would be nice to have something more explicit as this is a very public part of the API.

/// Length should return the total number of bytes in the input source.
/// It's mainly used to read the metadata, which is at the end of the source.
#[allow(clippy::len_without_is_empty)]
pub trait Length {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe io.rs is more suitable for this trait?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not because because it is also used by ChunkReader and io.rs contains internal implementations. But I admit that I still don't have it very clear how modules should be structured in Rust 😄


/// This is object to use if your file is already in memory.
/// The sliceable cursor is similar to std::io::Cursor, except that it makes it easy to create "cursor slices".
/// To achieve this, it uses Rc instead of shared references. Indeed reference fields are painfull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes maybe move this into a separate PR? given the current one is already big enough.

@rdettai
Copy link
Contributor Author

rdettai commented Oct 21, 2020

I'm having second thoughts about the ChunkReader interface. It has a length, but what we really want is the capability to "read from end" in order to get the parquet footer. So I'm thinking about removing the Length trait and give to get_read the capability to read from end, for instance by making start: u64 an enum { FromStart(u64), FromEnd }. This is particularly interesting when reading from a remote store where getting the size is expensive. What do you think ?

@alamb
Copy link
Contributor

alamb commented Oct 21, 2020

I'm having second thoughts about the ChunkReader interface.

I personally think the ChunkReader interface is good enough as is and this PR has been hanging out for quite a bit.

By adding more sophistication (FromStart, FromEnd) you may save an S3 or other object store` metadata request to get the length in theory, but I suspect most applications will already have the length metadata from other sources anyway (e.g. because they had to list the objects for some other reason) so acquiring the length may not be as expensive as it first appears.

@rdettai
Copy link
Contributor Author

rdettai commented Oct 21, 2020

I agree that this PR is hanging, but as this is an API change, I guess its better to think things through before moving forward! 😄 This should maybe have been prepared in a design document rather than in the PR... 😅

Its true that most of the time you will have the length around from an other source. But shouldn't we prefer a trait that contains exactly the minimum amount of information we need to make the FileReader implem work ?
Proposition:

pub enum ChunkMode {
    FromStart(u64),
    FromEnd(u64),
}

pub trait ChunkReader {
    type T: Read + Length;
    fn get_read(&self, start: ChunkMode, length: usize) -> Result<Self::T>;
}

I can have the implem ready by the end of the day!

@rdettai
Copy link
Contributor Author

rdettai commented Oct 22, 2020

Hi! I did some experimentation with this new interface (with ChunkMode). It works but is not flawless. You are right @alamb , we should move on with this PR the way it is currently implemented (with get_read(&self, start: u64, length: usize)), and we'll discuss the other possibility in a dedicated thread.
@sunchao I have responded to all your comments. If you are okay with my answers, I think we can merge this.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side. Thanks @rdettai !

@alamb
Copy link
Contributor

alamb commented Oct 26, 2020

FYI I have filed https://issues.apache.org/jira/browse/ARROW-10390 for the issue with TryClone not being public. I'll try and whip up a PR to get it fixed.

nevi-me pushed a commit that referenced this pull request Oct 26, 2020
…arquet writers

As of #8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported.

This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer

Here is what happens if you try and use `TryClone` on master

```
error[E0603]: module `util` is private
  --> delorean_parquet/src/writer.rs:11:5
   |
11 |     util::io::TryClone,
   |     ^^^^ private module
   |
note: the module `util` is defined here
  --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/7155cd5/rust/parquet/src/lib.rs:39:1
   |
39 | mod util;
   | ^^^^^^^^^

```

Closes #8528 from alamb/alamb/ARROW-10390-custom-parquet-writer

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
jorgecarleitao pushed a commit that referenced this pull request Oct 31, 2020
…ileSource

## Rationale
7155cd5 / #8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion).

That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively.

My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap  `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them.

## Changes

1. Publically export `SliceableCursor` and `FileSource`
2. Implement `Seek` for SliceableCursor
3. Implement `Debug` for both `SliceableCursor` and `FileSource`

Closes #8534 from alamb/alamb/ARROW-10396-expose-helpers

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…arquet writers

As of apache/arrow#8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported.

This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer

Here is what happens if you try and use `TryClone` on master

```
error[E0603]: module `util` is private
  --> delorean_parquet/src/writer.rs:11:5
   |
11 |     util::io::TryClone,
   |     ^^^^ private module
   |
note: the module `util` is defined here
  --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/c148da1/rust/parquet/src/lib.rs:39:1
   |
39 | mod util;
   | ^^^^^^^^^

```

Closes #8528 from alamb/alamb/ARROW-10390-custom-parquet-writer

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…ileSource

## Rationale
apache/arrow@c148da1 / apache/arrow#8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion).

That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively.

My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap  `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them.

## Changes

1. Publically export `SliceableCursor` and `FileSource`
2. Implement `Seek` for SliceableCursor
3. Implement `Debug` for both `SliceableCursor` and `FileSource`

Closes #8534 from alamb/alamb/ARROW-10396-expose-helpers

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…rces

https://issues.apache.org/jira/browse/ARROW-10135

Closes apache#8300 from rdettai/ARROW-10135-parquet-file-reader

Authored-by: rdettai <rdettai@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…arquet writers

As of apache#8300, it no longer appears possible to implement a `ParquetWriter` to provide a custom writer because it is not possible to implement the trait as `TryClone` is not publicly exported.

This PR publically exports that trait and adds an end to end test demonstrating it is possible to create a custom writer

Here is what happens if you try and use `TryClone` on master

```
error[E0603]: module `util` is private
  --> delorean_parquet/src/writer.rs:11:5
   |
11 |     util::io::TryClone,
   |     ^^^^ private module
   |
note: the module `util` is defined here
  --> /Users/alamb/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/7155cd5/rust/parquet/src/lib.rs:39:1
   |
39 | mod util;
   | ^^^^^^^^^

```

Closes apache#8528 from alamb/alamb/ARROW-10390-custom-parquet-writer

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ileSource

## Rationale
apache@7155cd5 / apache#8300 Reworked how the parquet reader traits were implemented to be interms of a `ChunkReader` trait (for the better, in my opinion).

That commit includes two helper classes, `SliceableCursor` and `FileSource`, which implement `ChunkReader` for a `Cursor` like thing and `File`s, respectively.

My project instantiates a `SerializedFileWriter` from the parquet crate with `struct`s that wrap  `File` and `Cursor` and thus I would like to re-use the logic in `SliceableCursor` and `FileSource` without having to copy/paste them.

## Changes

1. Publically export `SliceableCursor` and `FileSource`
2. Implement `Seek` for SliceableCursor
3. Implement `Debug` for both `SliceableCursor` and `FileSource`

Closes apache#8534 from alamb/alamb/ARROW-10396-expose-helpers

Authored-by: alamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants