Skip to content

Documentation is profoundly unhelpful to Rubyists new to Arrow #49352

@geeksam

Description

@geeksam

Describe the bug, including details regarding any error messages, version, and platform.

I apologize in advance for any snark here. I'll try to filter and/or edit it out, but I'm frustrated, and experience suggests that I won't be completely successful. I appreciate the effort involved in publishing and maintaining this code—I just wish it were more accessible, and I'm hoping that this "stream of consciousness" report may help build some empathy for newbies.

So.

I'm working on a project that involves using a SAX parser to transform data and store it in Parquet. I have 20 (not a typo: two full decades) years of experience writing Ruby, 18 of those full time, but I'm completely new to both Parquet and Arrow as of yesterday.

Yesterday, I searched rubygems.org for 'parquet' and found two gems: one named parquet and one named red-parquet. I decided to start with the parquet gem, and had TDD'd a working example in about three hours (including time spent extracting the SAX parser from its monolith and removing the extraneous bits). However, that gem is pre-1.0 and the repo shows signs of neglect, so I decided to try the red-parquet gem instead.

Two hours later, I've managed to get the gem to build (the README mentions rubygems-requirements-system, which I've never heard of, but brew install apache-arrow-glib did the job), and am trying to piece together something that writes to a file, starting with a single field.

The README for red-arrow has no useful examples. The README for red-arrow-format contains enough information for me to tell that that it's not what I want, so that's actually helpful. The README for red-arrow-parquet has a few examples of operations on data that seem like I might want to check them out later, but first I need to be able to write the thing...

Eventually, I notice an examples directory in red-arrow, and I open https://github.com/apache/arrow/blob/main/ruby/red-arrow/example/write-file.rb. The bit with fields and schema isn't especially Rubyish, but it seems straightforward enough for now (I figure I'll circle back around and try to figure out structs—not to be confused with Ruby's native Struct class—once I'm able to write some strings and integers).

Reading down, I skip over the two nested blocks, and see... arrays for each column? And then an array called columns that contains a lot of typed containers. So, given that my source data is in rows, it looks like I may need to manually transpose it for this API? Well, that's a problem for Future Sam. This code doesn't seem to use many Ruby idioms—but that's an editorial complaint, not a functional one, so I keep skimming down. Next I see a RecordBatch that gets initialized with a schema, 4, and the columns array.

Wait, 4? What does that magic number signify? NO IDEA. My ADHD brain simply MUST know, so...

I search for documentation, end up at https://rubydoc.info/gems/red-arrow/Arrow/RecordBatch, and see that the initializer has... no documentation. And no viewable source. Cool cool cool.

I try searching the web for usage examples. I don't find any. What I do find is a gem called parqueter that's... actually, hang on a minute—it's rather lovely. The examples are clear, they showcase an API that was clearly designed by a Rubyist, it has some features that I'd probably end up writing if I went the DIY route, and... HOLY CATS, THERE ARE ACTUAL COMMENTS IN THE CODE EXAMPLES. :fainting-goat:

The narrative portion ends here, because I've been burning glucose at a furious rate, and I. am. done.

--

In the strictest sense, these are accessibility/ergonomics/UX issues, not literal bugs. If one defines a "bug" as "unexpected behavior of the code at runtime," I can't possibly have experienced bugs with this project, because the onboarding experience was so unnecessarily confusing that I never even achieved runtime. But the sparse documentation is absolutely a barrier to adoption, and frankly, for my own projects, I'd consider that a bug. Y'all may not, and that's fair; recategorize or close this issue as you see fit.

I'd offer to contribute better examples, but I'm standing at the bottom end of a steep learning curve, staring up, and I still have a lot more unknown unknowns than anything else. The limited documentation and examples that do exist in these projects were clearly written by someone(s) suffering from the curse of knowledge [1, 2], and offer me no purchase. I might end up taking a tour through the parqueteur codebase, and that might help me understand your object model—if so, I'd be happy to circle back around and see what I can add to make things at least slightly less painful for the next person to try this out.

But if nothing else, I hope this can at least provide a gentle reminder that survivorship bias is a thing. Going with the original example that led to the coining of the term, think of me as a plane that didn't even get a chance to return from combat, because it crashed at the end of the runway on launch. :)

Component(s)

Ruby

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions