Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big refactoring to simplify code and implement more parts of the spec #26

Merged
merged 7 commits into from
Oct 1, 2020

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Sep 27, 2020

In particular this PR:

  • Changes the flow of writing so that input columns are converted to
    their "arrow" equivalent, most of the time lazily, so that the actual
    writing is much simpler and just deals with each array type
  • Introduces an ArrowTypes module that defines the ArrowType trait,
    which types can overload to signal what kind of arrow type they should
    be converted to. This isn't super fleshed out quite yet as we need to
    figure out the acutal requirements for the different arrow types, but
    it's a start
  • Support reading/writing compressed arrow buffers automatically
    (reading) and via keyword arg (writing, compress=:lz4 or
    compress=:zstd
  • support nested dict encoding (fixes Support writing dict encodings for nested columns #15)
  • fixes Unable to round-trip a DataFrame #24; not exactly sure what the issue was, but it doesn't
    happen on this branch
  • reorganizes tests so that for each kind of "table", we do IPC test,
    compressed buffers test, and file test
  • support for nesting "special" types like Char/Symbol; previously they were only allowed as top-level columns, but now they can appear in NamedTuple, vector of vectors, etc.
  • Fix the Map type; I mistakenly thought Map meant that each element of the column was a Pair, but it actually means each element is a Dict. So it's more similar to the List type than Struct; this fixes the reading/writing for this type.
  • ensure array types that need hold a reference to their unsafe_wraped bytes; these bytes may be from the original arrow blob or an uncompressed buffer

In particular this PR:
  * Changes the flow of writing so that input columns are converted to
  their "arrow" equivalent, most of the time lazily, so that the actual
  writing is much simpler and just deals with each array type
  * Introduces an ArrowTypes module that defines the ArrowType trait,
  which types can overload to signal what kind of arrow type they should
  be converted to. This isn't super fleshed out quite yet as we need to
  figure out the acutal requirements for the different arrow types, but
  it's a start
  * Support reading/writing compressed arrow buffers automatically
  (reading) and via keyword arg (writing, `compress=:lz4` or
  `compress=:zstd`
  * support nested dict encoding (fixes #15)
  * fixes #24; not exactly sure what the issue was, but it doesn't
  happen on this branch
  * reorganizes tests so that for each kind of "table", we do IPC test,
  compressed buffers test, and file test
@codecov
Copy link

codecov bot commented Oct 1, 2020

Codecov Report

Merging #26 into master will increase coverage by 1.63%.
The diff coverage is 82.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #26      +/-   ##
==========================================
+ Coverage   79.01%   80.64%   +1.63%     
==========================================
  Files          13       14       +1     
  Lines        2063     2557     +494     
==========================================
+ Hits         1630     2062     +432     
- Misses        433      495      +62     
Impacted Files Coverage Δ
src/FlatBuffers/FlatBuffers.jl 66.25% <0.00%> (-3.88%) ⬇️
src/metadata/Schema.jl 81.25% <0.00%> (+3.70%) ⬆️
src/Arrow.jl 28.57% <23.07%> (-71.43%) ⬇️
src/utils.jl 73.89% <71.12%> (+2.88%) ⬆️
src/arraytypes.jl 76.98% <76.03%> (+1.55%) ⬆️
src/arrowtypes.jl 80.00% <80.00%> (ø)
src/metadata/Message.jl 74.56% <80.00%> (+14.03%) ⬆️
src/write.jl 94.23% <93.75%> (+3.61%) ⬆️
src/table.jl 93.41% <95.36%> (+0.24%) ⬆️
src/eltypes.jl 83.91% <98.80%> (+6.23%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb39e96...3cb988f. Read the comment docs.

@quinnj quinnj merged commit 1a1d3c5 into master Oct 1, 2020
@quinnj quinnj deleted the jq/compression branch October 1, 2020 05:03
quinnj added a commit that referenced this pull request Oct 3, 2020
…#26)

* Big refactoring to simplify code and implement more parts of the spec

In particular this PR:
  * Changes the flow of writing so that input columns are converted to
  their "arrow" equivalent, most of the time lazily, so that the actual
  writing is much simpler and just deals with each array type
  * Introduces an ArrowTypes module that defines the ArrowType trait,
  which types can overload to signal what kind of arrow type they should
  be converted to. This isn't super fleshed out quite yet as we need to
  figure out the acutal requirements for the different arrow types, but
  it's a start
  * Support reading/writing compressed arrow buffers automatically
  (reading) and via keyword arg (writing, `compress=:lz4` or
  `compress=:zstd`
  * support nested dict encoding (fixes #15)
  * fixes #24; not exactly sure what the issue was, but it doesn't
  happen on this branch
  * reorganizes tests so that for each kind of "table", we do IPC test,
  compressed buffers test, and file test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to round-trip a DataFrame Support writing dict encodings for nested columns
1 participant